OpenCV - Basic Operations - Performance Issue [in Mode: Release] - c++

I might discovered a huge performance issue with OpenCV's own implementation of matrix multiplication / summation, and wanted to check with you guys if I maybe missing something:
In advance: All runs were done in (OpenCV's) Release Mode.
(a) I'll do 10 million times a matrix-vector multiplication with a 3-by-3 matrix and a 3-by-1 vector. The implementation follows the code: res = mat * vec;
(b) I'll do the same with my own implementation of accessing the elements individually and then doing the multiplication process using pointer-arithmetic. [basically just multiplying out the process and writing down the equations for each row for the result vector]
I tested these variants with the compiler flags -O0, -O1, -O2, -O3, -Ofast and for OpenCV 3.1 & 3.2.
The timings are done using chrono (high_resolution_clock) on Ubuntu 16.04.
In all cases the non-optimized method (b) outperforms the OpenCV method (a) by a factor of ~100 to ~1000.
How can that be the case? Shouldn't OpenCV be optimized for these kinds of procedures? Should I raise an issue on Github, or is there something I'm totally missing?
Code: [Ready to copy and test on your machine]
#include <chrono>
#include <iostream>
#include "opencv2/core/cvstd.hpp"
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
int main()
// 1. Setup:
std::vector<std::chrono::high_resolution_clock::time_point> timestamp_vec_start(2);
std::vector<std::chrono::high_resolution_clock::time_point> timestamp_vec_end(2);
std::vector<double> timestamp_vec_total(2);
cv::Mat test_mat = (cv::Mat_<float>(3,3) << 0.023, 232.33, 0.545,
22.22, 0.1123, 4.444,
0.012, 3.4521, 0.202);
cv::Mat test_vec = (cv::Mat_<float>(3,1) << 5.77,
cv::Mat result_1 = cv::Mat(3, 1, CV_32FC1);
cv::Mat result_2 = cv::Mat(3, 1, CV_32FC1);
cv::Mat temp_test_mat_results = cv::Mat(3, 3, CV_32FC1);
cv::Mat temp_test_vec_results = cv::Mat(3, 1, CV_32FC1);
auto ptr_test_mat_res_0 = temp_test_mat_results.ptr<float>(0);
auto ptr_test_mat_res_1 = temp_test_mat_results.ptr<float>(1);
auto ptr_test_mat_res_2 = temp_test_mat_results.ptr<float>(2);
auto ptr_test_vec_res_0 = temp_test_vec_results.ptr<float>(0);
auto ptr_test_vec_res_1 = temp_test_vec_results.ptr<float>(1);
auto ptr_test_vec_res_2 = temp_test_vec_results.ptr<float>(2);
auto ptr_res_0 = result_2.ptr<float>(0);
auto ptr_res_1 = result_2.ptr<float>(1);
auto ptr_res_2 = result_2.ptr<float>(2);
// 2. OpenCV Basic Matrix Operations:
timestamp_vec_start[0] = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10000000; ++i)
// factor of up to 5000 here:
// result_1 = (test_mat + test_mat + test_mat) * (test_vec + test_vec);
// factor of 30~100 here:
result_1 = test_mat * test_vec;
timestamp_vec_end[0] = std::chrono::high_resolution_clock::now();
timestamp_vec_total[0] = static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(timestamp_vec_end[0] - timestamp_vec_start[0]).count());
// 3. Pixel-Wise Operations:
timestamp_vec_start[1] = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10000000; ++i)
auto ptr_test_mat_0 = test_mat.ptr<float>(0);
auto ptr_test_mat_1 = test_mat.ptr<float>(1);
auto ptr_test_mat_2 = test_mat.ptr<float>(2);
auto ptr_test_vec_0 = test_vec.ptr<float>(0);
auto ptr_test_vec_1 = test_vec.ptr<float>(1);
auto ptr_test_vec_2 = test_vec.ptr<float>(2);
ptr_test_mat_res_0[0] = ptr_test_mat_0[0] + ptr_test_mat_0[0] + ptr_test_mat_0[0];
ptr_test_mat_res_0[1] = ptr_test_mat_0[1] + ptr_test_mat_0[1] + ptr_test_mat_0[1];
ptr_test_mat_res_0[2] = ptr_test_mat_0[2] + ptr_test_mat_0[2] + ptr_test_mat_0[2];
ptr_test_mat_res_1[0] = ptr_test_mat_1[0] + ptr_test_mat_1[0] + ptr_test_mat_1[0];
ptr_test_mat_res_1[1] = ptr_test_mat_1[1] + ptr_test_mat_1[1] + ptr_test_mat_1[1];
ptr_test_mat_res_1[2] = ptr_test_mat_1[2] + ptr_test_mat_1[2] + ptr_test_mat_1[2];
ptr_test_mat_res_2[0] = ptr_test_mat_2[0] + ptr_test_mat_2[0] + ptr_test_mat_2[0];
ptr_test_mat_res_2[1] = ptr_test_mat_2[1] + ptr_test_mat_2[1] + ptr_test_mat_2[1];
ptr_test_mat_res_2[2] = ptr_test_mat_2[2] + ptr_test_mat_2[2] + ptr_test_mat_2[2];
ptr_test_vec_res_0[0] = ptr_test_vec_0[0] + ptr_test_vec_0[0];
ptr_test_vec_res_1[0] = ptr_test_vec_1[0] + ptr_test_vec_1[0];
ptr_test_vec_res_2[0] = ptr_test_vec_2[0] + ptr_test_vec_2[0];
ptr_res_0[0] = ptr_test_mat_res_0[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_0[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_0[2]*ptr_test_vec_res_2[0];
ptr_res_1[0] = ptr_test_mat_res_1[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_1[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_1[2]*ptr_test_vec_res_2[0];
ptr_res_2[0] = ptr_test_mat_res_2[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_2[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_2[2]*ptr_test_vec_res_2[0];
timestamp_vec_end[1] = std::chrono::high_resolution_clock::now();
timestamp_vec_total[1] = static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(timestamp_vec_end[1] - timestamp_vec_start[1]).count());
// 4. Printout Timing Results:
std::cout << "\n\nTimings:\n\n";
std::cout << "Time spent in OpenCV's implementation: " << timestamp_vec_total[0]/1000.0 << " ms.\n";
std::cout << "Time spent in element-wise implementation: " << timestamp_vec_total[1]/1000.0 << " ms.\n\n";
return 0;

OpenCV is not optimized for small matrix operations.
You can reduce your overhead a little by not allocating a new Matrix for the result inside the loop by using cv::gemm
But if small matrix operations are a bottleneck for you I recommend using Eigen.
Using a quick Eigen implementation like:
Eigen::Matrix3d mat;
mat << 0.023, 232.33, 0.545,
22.22, 0.1123, 4.444,
0.012, 3.4521, 0.202;
Eigen::Vector3d vec3;
vec3 << 5.77,
Eigen::Vector3d result_e;
for (int i = 0; i < 10000000; ++i)
result_e = (mat *3 ) * (vec3 *2);
gives me the following numbers with VS2015 (obviously the difference might be less dramatic in GCC or Clang):
Time spent in OpenCV's implementation: 2384.45 ms.
Time spent in element-wise implementation: 78.653 ms.
Time spent in Eigen implementation: 36.088 ms.


What should you store or append a batch of tensors to in C++ when using LibTorch?

In C++, when using LibTorch (The C++ version of PyTorch), what should you store a batch of tensors in? I'm running into the problem of not being able to reset the batch on the next step because C++ doesn't allow storing a new variable over an existing variable.
In my attempt my batch of tensors is one single 385x385 tensor. The batch size is 385. In a for loop I use torch::cat to concatenate 385 smaller 1D tensors, which are 385 numbers long. (Maybe 'stack' or 'append' are better terms for what I'm doing since the are stacked together picket fence style more than 'concatenated', but that's what I'm using.) Anyways, there is not problem with this shape. It seems to work fine for one forward and backward pass but then the tensor becomes 770x385 on the next pass instead of a 385x385 tensor of the next 385, 385 long arrays. I hope I am painting a picture and not being too verbose.
The code.
Near the bottom I have the line all_step_obs = torch::tensor({}); to try to wipe out the contents of the tensor, AKA, the batch, but this gives me a Segmentation fault (core dumped). I guess for trying to access the tensor outside of the loop(?)
If I don't have this line I get a 770x385 tensor after the next step.
The model
#include "mujoco/mujoco.h"
struct Net : torch::nn::Module {
torch::Tensor action_high, action_low;
Net(torch::Tensor action_high, torch::Tensor action_low) : action_high(action_high), action_low(action_low){
// Construct and register two Linear submodules.
fc1 = torch::nn::Linear(385, 385);
fc2 = torch::nn::Linear(385, 385);
fc3 = torch::nn::Linear(385, 42);
// cholesky_layer = torch::nn::Linear(385, (42 * (42 + 1)) / 2);
cholesky_layer = torch::nn::Linear(385, 385);
// Implement the Net's algorithm.
torch::Tensor forward(torch::Tensor x) {
// Use one of many tensor manipulation functions.
x = torch::relu(fc1->forward(x));
x = torch::dropout(x, /*p=*/0.2, /*train=*/is_training());
x = torch::relu(fc2->forward(x));
auto mean_layer = fc3->forward(x);
auto mean = action_low + (action_high - action_low) * mean_layer;
auto chol_l = cholesky_layer->forward(x);
// auto chol = torch::rand({385, 385});
auto chol = torch::matmul(chol_l, chol_l.transpose(0, 1));
chol = torch::nan_to_num(chol, 0, 2.0);
chol = chol.add(torch::eye(385));
auto cholesky = torch::linalg::cholesky(chol);
// return torch::cat({mean, cholesky}, 0);
return mean_layer;
// Use one of many "standard library" modules.
torch::nn::Linear fc1{nullptr}, fc2{nullptr}, fc3{nullptr}, cholesky_layer{nullptr};
The training
auto high = torch::ones({385, 42}) * 0.4;
auto low = torch::ones({385, 42}) * -0.4;
auto actor = Net(low, high);
int max_steps = 385;
int steps = 2000;
auto l1_loss = torch::smooth_l1_loss;
auto optimizer = torch::optim::Adam(actor.parameters(), 3e-4);
torch::Tensor train() {
torch::Tensor all_step_obs;
for (int i = 0; i<steps; ++i)
for (int i = 0; i<max_steps; ++i)
all_step_obs = torch::cat({torch::rand({385}).unsqueeze(0), all_step_obs});
auto mean = actor.forward(all_step_obs);
auto loss = l1_loss(mean, torch::rand({385, 42}), 1, 0);
all_step_obs = torch::tensor({});
if (steps == 1999) {
return loss;
int main (int argc, const char** argv) {
std::cout << train();

How to send an image(cv::Mat) from C++ to Python using Pybind11? [duplicate]

I have a c++ application that sends data through to a python function over shared memory.
This works great using ctypes in Python such as doubles and floats. Now, I need to add a cv::Mat to the function.
My code currently is:
#include <iostream>
#include <opencv2\core.hpp>
#include <opencv2\highgui.hpp>
struct TransferData
double score;
float other;
int num;
int w;
int h;
int channels;
uchar* data;
#define C_OFF 1000
void fill(TransferData* data, int run, uchar* frame, int w, int h, int channels)
data->score = C_OFF + 1.0;
data->other = C_OFF + 2.0;
data->num = C_OFF + 3;
data->w = w;
data->h = h;
data->channels = channels;
data->data = frame;
namespace py = pybind11;
using namespace boost::interprocess;
void main()
//python setup
py::scoped_interpreter guard{};
py::module py_test = py::module::import("Transfer_py");
// Create Data
windows_shared_memory shmem(create_only, "TransferDataSHMEM",
read_write, sizeof(TransferData));
mapped_region region(shmem, read_write);
std::memset(region.get_address(), 0, sizeof(TransferData));
TransferData* data = reinterpret_cast<TransferData*>(region.get_address());
for (int i = 0; i < 10; i++)
int64 t0 = cv::getTickCount();
std::cout << "C++ Program - Filling Data" << std::endl;
cv::Mat frame = cv::imread("input.jpg");
fill(data, i,, frame.cols, frame.rows, frame.channels());
//run the python function
py::object result = py_test.attr("datathrough")();
int64 t1 = cv::getTickCount();
double secs = (t1 - t0) / cv::getTickFrequency();
std::cout << "took " << secs * 1000 << " ms" << std::endl;
//transfer data class
import ctypes
class TransferData(ctypes.Structure):
_fields_ = [
('score', ctypes.c_double),
('other', ctypes.c_float),
('num', ctypes.c_int),
('w', ctypes.c_int),
('h', ctypes.c_int),
('frame', ctypes.c_void_p),
('channels', ctypes.c_int)
PY_OFF = 2000
def fill(data):
data.score = PY_OFF + 1.0
data.other = PY_OFF + 2.0
data.num = PY_OFF + 3
//main Python function
import TransferData
import sys
import mmap
import ctypes
def datathrough():
shmem = mmap.mmap(-1, ctypes.sizeof(TransferData.TransferData), "TransferDataSHMEM")
data = TransferData.TransferData.from_buffer(shmem)
print('Python Program - Getting Data')
print('Python Program - Filling Data')
How can I add the cv::Mat frame data into the Python side? I am sending it as a uchar* from c++, and as i understand, I need it to be a numpy array to get a cv2.Mat in Python. What is the correct approach here to go from 'width, height, channels, frameData' to an opencv python cv2.Mat?
I am using shared memory because speed is a factor, I have tested using the Python API approach, and it is much too slow for my needs.
The general idea (as used in the OpenCV Python bindings) is to create a numpy ndarray that shares its data buffer with the Mat object, and pass that to the Python function.
Note: At this point, I'll limit the example to continuous matrices only.
We can take advantage of the pybind11::array class.
We need to determine the appropriate dtype for the numpy array to use. This is a simple 1-to-1 mapping, which we can do using a switch:
py::dtype determine_np_dtype(int depth)
switch (depth) {
case CV_8U: return py::dtype::of<uint8_t>();
case CV_8S: return py::dtype::of<int8_t>();
case CV_16U: return py::dtype::of<uint16_t>();
case CV_16S: return py::dtype::of<int16_t>();
case CV_32S: return py::dtype::of<int32_t>();
case CV_32F: return py::dtype::of<float>();
case CV_64F: return py::dtype::of<double>();
throw std::invalid_argument("Unsupported data type.");
Determine the shape for the numpy array. To make this behave similarly to OpenCV, let's have it map 1-channel Mats to 2D numpy arrays, and multi-channel Mats to 3D numpy arrays.
std::vector<std::size_t> determine_shape(cv::Mat& m)
if (m.channels() == 1) {
return {
, static_cast<size_t>(m.cols)
return {
, static_cast<size_t>(m.cols)
, static_cast<size_t>(m.channels())
Provide means of extending the shared buffer's lifetime to the lifetime of the numpy array. We can create a pybind11::capsule around a shallow copy of the source Mat -- due to the way the object is implemented, this effectively increases its reference count for the required amount of time.
py::capsule make_capsule(cv::Mat& m)
return py::capsule(new cv::Mat(m)
, [](void *v) { delete reinterpret_cast<cv::Mat*>(v); }
Now, we can perform the conversion.
py::array mat_to_nparray(cv::Mat& m)
if (!m.isContinuous()) {
throw std::invalid_argument("Only continuous Mats supported.");
return py::array(determine_np_dtype(m.depth())
, determine_shape(m)
, make_capsule(m));
Let's assume, we have a Python function like
def foo(arr):
captured in a pybind object fun. Then to call this function from C++ using a Mat as a source we'd do something like this:
cv::Mat img; // Initialize this somehow
auto result = fun(mat_to_nparray(img));
Sample Program
#include <pybind11/pybind11.h>
#include <pybind11/embed.h>
#include <pybind11/numpy.h>
#include <pybind11/stl.h>
#include <opencv2/opencv.hpp>
#include <iostream>
namespace py = pybind11;
// The 4 functions from above go here...
int main()
// Start the interpreter and keep it alive
py::scoped_interpreter guard{};
try {
auto locals = py::dict{};
import numpy as np
def test_cpp_to_py(arr):
return (arr[0,0,0], 2.0, 30)
auto test_cpp_to_py = py::globals()["test_cpp_to_py"];
for (int i = 0; i < 10; i++) {
int64 t0 = cv::getTickCount();
cv::Mat img(cv::Mat::zeros(1024, 1024, CV_8UC3) + cv::Scalar(1, 1, 1));
int64 t1 = cv::getTickCount();
auto result = test_cpp_to_py(mat_to_nparray(img));
int64 t2 = cv::getTickCount();
double delta0 = (t1 - t0) / cv::getTickFrequency() * 1000;
double delta1 = (t2 - t1) / cv::getTickFrequency() * 1000;
std::cout << "* " << delta0 << " ms | " << delta1 << " ms" << std::endl;
} catch (py::error_already_set& e) {
std::cerr << e.what() << "\n";
return 0;
Console Output
* 4.56413 ms | 0.225657 ms
* 3.95923 ms | 0.0736127 ms
* 3.80335 ms | 0.0438603 ms
* 3.99262 ms | 0.0577587 ms
* 3.82262 ms | 0.0572 ms
* 3.72373 ms | 0.0394603 ms
* 3.74014 ms | 0.0405079 ms
* 3.80621 ms | 0.054546 ms
* 3.72177 ms | 0.0386222 ms
* 3.70683 ms | 0.0373651 ms
I truly liked the answer of Dan Mašek! Based on this insights, I built a small library (, that provides:
Explicit transformers between cv::Mat and numpy.ndarray with shared memory (as was shown in his answer)
Explicit transformers between cv::Matx and numpy.ndarray, with shared memory
It also provides automatic casts:
Casts with shared memory between cv::Mat, cv::Matx, cv::Vec and numpy.ndarray
Casts without shared memory for simple types, between cv::Size, cv::Point, cv::Point3 and python tuple

openCV cv::matchTemplate running twice slower on a "better/newer" intel cpu

I am using cv::matchTemplate to track a moving object in a video.
However, running the template matching of open cv with a small picture can be slower on a better/newer intel's CPU. The code snippet below run typically 2 times slower on a i9-7920x (0.28ms/match) than a i7-9700k (0.14ms/match).
#include <chrono>
#include <fstream>
#include <opencv2/opencv.hpp>
#pragma optimize("", off)
int main()
cv::Mat haystack;
cv::Mat needle;
cv::Mat result;
cv::Rect rect;
haystack = cv::imread("C:/President_Barack_Obama.jpg");
rect.width = 64;
rect.height = 64;
haystack = haystack(rect);
rect.width = 12;
rect.height = 12;
rect.x = 50;
rect.y = 50;
needle = haystack(rect);
auto start = std::chrono::high_resolution_clock::now();
int nbmatch = 10000;
for (int i = 0; i < nbmatch; i++) {
cv::matchTemplate(haystack, needle, result, cv::TemplateMatchModes::TM_CCOEFF_NORMED);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << "time per match: " << (diff.count() / nbmatch) * 1000 << " ms\n";
In my real application, I noticed this:
i7-9700k: 1ms;
i7-6800k: 1.3ms;
i9-7920x: 2.8ms;
i9-9820x: 2.8ms.
Both the i9 are slower by a fair amount that could not be explained by the slight difference in clock speed.
Win 7 or 10 does not make a difference. It is compiled with Visual Studio 2019 (v142). Open CV is compiled from the source with the pre-built libraries (building it myself did not help).
The capacity to scale the frequency seems to have an important impact. If runned single threaded the i9-7920x still run in 2.8ms if I sleep regularily but if I yield instead (cpu load of 100%) it lower to 1.9ms.
What could explain this?
Do you think it is possible to bring all processor to compute in the same range of time using cv::matchTemplate?
What could I do else to reduce my computation time?

Error in gauss-newton implementation for pose optimization

I’m using a modified version of a gauss-newton method to refine a pose estimate using OpenCV. The unmodified code can be found here:
The details of this approach are outlined in the corresponding paper:
Marchand, Eric, Hideaki Uchiyama, and Fabien Spindler. "Pose
estimation for augmented reality: a hands-on survey." IEEE
transactions on visualization and computer graphics 22.12 (2016):
A PDF can be found here:
The part that is relevant (Pages 4 and 5) are screencapped below:
Here is what I have done. First, I’ve (hopefully) “corrected” some errors: (a) dt and dR can be passed by reference to exponential_map() (even though cv::Mat is essentially a pointer). (b) The last entry of each 2x6 Jacobian matrix,<double>(i*2+1,5), was -x[i].y but should be -x[i].x. (c) I’ve also tried using a different formula for the projection. Specifically, one that includes the focal length and principal point:<double>(i*2,0) = cx + fx *<double>(0,0) /<double>(2,0);<double>(i*2+1,0) = cy + fy *<double>(1,0) /<double>(2,0);
Here is the relevant code I am using, in its entirety (control starts at optimizePose3()):
void exponential_map(const cv::Mat &v, cv::Mat &dt, cv::Mat &dR)
double vx =<double>(0,0);
double vy =<double>(1,0);
double vz =<double>(2,0);
double vtux =<double>(3,0);
double vtuy =<double>(4,0);
double vtuz =<double>(5,0);
cv::Mat tu = (cv::Mat_<double>(3,1) << vtux, vtuy, vtuz); // theta u
cv::Rodrigues(tu, dR);
double theta = sqrt(;
double sinc = (fabs(theta) < 1.0e-8) ? 1.0 : sin(theta) / theta;
double mcosc = (fabs(theta) < 2.5e-4) ? 0.5 : (1.-cos(theta)) / theta / theta;
double msinc = (fabs(theta) < 2.5e-4) ? (1./6.) : (1.-sin(theta)/theta) / theta / theta;<double>(0,0) = vx*(sinc + vtux*vtux*msinc)
+ vy*(vtux*vtuy*msinc - vtuz*mcosc)
+ vz*(vtux*vtuz*msinc + vtuy*mcosc);<double>(1,0) = vx*(vtux*vtuy*msinc + vtuz*mcosc)
+ vy*(sinc + vtuy*vtuy*msinc)
+ vz*(vtuy*vtuz*msinc - vtux*mcosc);<double>(2,0) = vx*(vtux*vtuz*msinc - vtuy*mcosc)
+ vy*(vtuy*vtuz*msinc + vtux*mcosc)
+ vz*(sinc + vtuz*vtuz*msinc);
void optimizePose3(const PoseEstimation &pose,
std::vector<FeatureMatch> &feature_matches,
PoseEstimation &optimized_pose) {
//Set camera parameters
double fx =<double>(0, 0); //Focal length
double fy =<double>(1, 1);
double cx =<double>(0, 2); //Principal point
double cy =<double>(1, 2);
auto inlier_matches = getInliers(pose, feature_matches);
std::vector<cv::Point3d> wX;
std::vector<cv::Point2d> x;
const unsigned int npoints = inlier_matches.size();
cv::Mat J(2*npoints, 6, CV_64F);
double lambda = 0.25;
cv::Mat xq(npoints*2, 1, CV_64F);
cv::Mat xn(npoints*2, 1, CV_64F);
double residual=0, residual_prev;
cv::Mat Jp;
for(auto i = 0u; i < npoints; i++) {
//Model points
const cv::Point2d &M = inlier_matches[i].model_point();
wX.emplace_back(M.x, M.y, 0.0);
//Imaged points
const cv::Point2d &I = inlier_matches[i].image_point();<double>(i*2,0) = I.x; // x<double>(i*2+1,0) = I.y; // y
//Initial estimation
cv::Mat cRw = pose.rotation_matrix;
cv::Mat ctw = pose.translation_vector;
int nIters = 0;
// Iterative Gauss-Newton minimization loop
do {
for (auto i = 0u; i < npoints; i++) {
cv::Mat cX = cRw * cv::Mat(wX[i]) + ctw; // Update cX, cY, cZ
// Update x(q)
//<double>(i*2,0) =<double>(0,0) /<double>(2,0); // x(q) = cX/cZ
//<double>(i*2+1,0) =<double>(1,0) /<double>(2,0); // y(q) = cY/cZ<double>(i*2,0) = cx + fx *<double>(0,0) /<double>(2,0);<double>(i*2+1,0) = cy + fy *<double>(1,0) /<double>(2,0);
// Update J using equation (11)<double>(i*2,0) = -1 /<double>(2,0); // -1/cZ<double>(i*2,1) = 0;<double>(i*2,2) = x[i].x /<double>(2,0); // x/cZ<double>(i*2,3) = x[i].x * x[i].y; // xy<double>(i*2,4) = -(1 + x[i].x * x[i].x); // -(1+x^2)<double>(i*2,5) = x[i].y; // y<double>(i*2+1,0) = 0;<double>(i*2+1,1) = -1 /<double>(2,0); // -1/cZ<double>(i*2+1,2) = x[i].y /<double>(2,0); // y/cZ<double>(i*2+1,3) = 1 + x[i].y * x[i].y; // 1+y^2<double>(i*2+1,4) = -x[i].x * x[i].y; // -xy<double>(i*2+1,5) = -x[i].x; // -x
cv::Mat e_q = xq - xn; // Equation (7)
cv::Mat Jp = J.inv(cv::DECOMP_SVD); // Compute pseudo inverse of the Jacobian
cv::Mat dq = -lambda * Jp * e_q; // Equation (10)
cv::Mat dctw(3, 1, CV_64F), dcRw(3, 3, CV_64F);
exponential_map(dq, dctw, dcRw);
cRw = dcRw.t() * cRw; // Update the pose
ctw = dcRw.t() * (ctw - dctw);
residual_prev = residual; // Memorize previous residual
residual =; // Compute the actual residual
std::cout << "residual_prev: " << residual_prev << std::endl;
std::cout << "residual: " << residual << std::endl << std::endl;
} while (fabs(residual - residual_prev) > 0);
//} while (nIters < 30);
optimized_pose.rotation_matrix = cRw;
optimized_pose.translation_vector = ctw;
cv::Rodrigues(optimized_pose.rotation_matrix, optimized_pose.rotation_vector);
Even when I use the functions as given, it does not produce the correct results. My initial pose estimate is very close to optimal, but when I try run the program, the method takes a very long time to converge - and when it does, the results are very wrong. I’m not sure what could be wrong and I’m out of ideas. I’m confident my inliers are actually inliers (they were chosen using an M-estimator). I’ve compared the results from exponential map with those from other implementations, and they seem to agree.
So, where is the error in this gauss-newton implementation for pose optimization? I’ve tried to make things as easy as possible for anyone willing to lend a hand. Let me know if there is anymore information I can provide. Any help would be greatly appreciated. Thanks.
Edit: 2019/05/13
There is now solvePnPRefineVVS function in OpenCV.
Also, you should use x and y calculated from the current estimated pose instead.
In the cited paper, they expressed the measurements x in the normalized camera frame (at z=1).
When working with real data, you have:
(u,v): 2D image coordinates (e.g. keypoints, corner locations, etc.)
K: the intrinsic parameters (obtained after calibrating the camera)
D: the distortion coefficients (obtained after calibrating the camera)
To compute the 2D image coordinates in the normalized camera frame, you can use in OpenCV the function cv::undistortPoints() (link to my answer about cv::projectPoints() and cv::undistortPoints()).
When there is no distortion, the computation (also called "reverse perspective transformation") is:
x = (u - cx) / fx
y = (v - cy) / fy

Efficiency of summing images using MATLAB and OpenCV

I am totally surprised by all of your answers. Thank you very much!
The bug code is showed as following:
percentage = (double)kk * 100.0 / (double)totalnum;
After I modified it to:
percentage = (double)kk * 100.0 / totalnum;
The problem is SOLVED. And this simple division consumed about 90s out of 150s. Maybe division between double and int is faster than it between doubles.
Again, thanks for all of your answers!
I'm trying to getting the average image from a set of pictures which come from a video. There are only 2 steps for this job:
Sum up all the images into a matrix.
Divide the matrix by the number of images.
I used following code in OpenCV: (C++)
Mat avIM = Mat::zeros(IMG_HEIGHT, IMG_WIDTH, CV_32FC3);
for (ii = startnum; ii <= endnum; ii += interval) {
string fullname = argv[1];
sprintf(filename, "\\%d.png", ii);
Mat tempIM = imread(fullname.c_str());
if (tempIM.empty()) { cout << "Can't open image!\n"; return -1; }
tempIM.convertTo(tempIM, CV_32FC3);
avIM += tempIM; //Sum up every image
avIM = avIM * (double)(1.0 / kk); //get average'
And following code in MatLab: (2015a)
avIM = zeros(size(imread([im.dir,'\',num2str(startnum),'.png'])));
pointIdx = startnum:interval:endnum;
for j=pointIdx,
IM = imread([im.dir,'\',num2str(j),'.png']);
avIM = avIM + double(IM); %Sum up every image
avIM = uint8(round(avIM./size(pointIdx,2))); %get average
But when I run those two program on 2,100 images, OpenCV took 150.3s(Release) and MatLab took 103.1s. It really confused me that a C++ program runs slower than a MatLab script.
So what's slowing down my OpenCV program? If it's caused by my method of matrix accessing, what should I do to improve the efficiency?
Your code seems good enough, and in my tests I found it's running 10 times faster than Matlab code.
However, I show a slightly optimized code, that performs a little faster than yours.
Please note that I don't have a folder with images named as you, so I used cv::glob in C++ version, and dir in Matlab version to get the names of the images in the folder.
In my folder I have 82 small images, so the running time is obviously smaller than yours, but the relative performance should be reliable.
Execution time
Sum only Get filenames + Sum
Matlab: 0.173543 s (0.185308 s)
OpenCV #Seven Wang: 0.0145206 s (0.0155748 s)
OpenCV #Miki: 0.0128943 s (0.013333 s)
Be sure that you're computing the running time consistently in OpenCV and Matlab.
Matlab code:
folder = 'D:\\SO\\temp\\old_075_6\\';
filenames = dir([folder '*.bmp']);
% Get rows and cols from 1st image
img = imread([folder name]);
S = zeros(size(img));
for ii = 1 : length(filenames)
name = filenames(ii).name;
currentImage = imread([folder name]);
S = S + double(currentImage);
S = uint8(round(S / length(filenames)));
C++ code:
#include <opencv2\opencv.hpp>
#include <vector>
#include <iostream>
int main()
double ticLoad = double(cv::getTickCount());
std::string folder = "D:\\SO\\temp\\old_075_6\\*.bmp";
std::vector<cv::String> filenames;
cv::glob(folder, filenames);
int rows, cols;
// Just load the first image to get rows and cols
cv::Mat3b img = cv::imread(filenames[0]);
rows = img.rows;
cols = img.cols;
double tic = double(cv::getTickCount());
cv::Mat3d S(rows, cols, 0.0);
for (const auto& name : filenames)
cv::Mat currentImage = cv::imread(name);
currentImage.convertTo(currentImage, CV_64F);
S += currentImage;
S = S * double(1.0 / filenames.size());
cv::Mat3b avg;
S.convertTo(avg, CV_8U);
double toc = double(cv::getTickCount());
double timeLoad = (toc - ticLoad) / cv::getTickFrequency();
double time = (toc - tic) / cv::getTickFrequency();
std::cout << "#Seven Wang: " << time << " s (" << timeLoad << " s)" << std::endl;
double tic = double(cv::getTickCount());
cv::Mat3d S(rows, cols, 0.0);
cv::Mat3b currentImage;
for (const auto& name : filenames)
currentImage = cv::imread(name);
cv::add(S, currentImage, S, cv::noArray(), CV_64F);
S /= filenames.size();
cv::Mat3b avg;
S.convertTo(avg, CV_8U);
double toc = double(cv::getTickCount());
double timeLoad = (toc - ticLoad) / cv::getTickFrequency();
double time = (toc - tic) / cv::getTickFrequency();
std::cout << "#Miki: " << time << " s (" << timeLoad << " s)" << std::endl;
return 0;
One point that drew my attention is the type "CV_32FC3". Are you specifically preferring that 32 bit float matrix and are you sure Matlab as well gets the pixel values the same way?
Because you have that extra step
tempIM.convertTo(tempIM, CV_32FC3);
in your Cpp code, where Matlab directly operates as soon as it retrieves the image without any conversion, which might be slowing down your cpp code. Furthermore, if Matlab is not getting the image in float values, that might be contributing the speed difference as float point arithmetics is a harder task for CPU to handle compared to integers.