A better way to make matrix - log operations in Eigen? - c++

I am playing around with Eigen doing some calculations with matrices and logs/exp, but I found the expressions I got a bit clumsy (and also possibly slower?). Is there a better way to write calculations like this ?
MatrixXd m = MatrixXd::Random(3,3);
m = m * (m.array().log()).matrix();
That is, not having to convert to arrays, then back to a matrix ?

If you are mixing array and matrix operations you can't really avoid them, except for some functions which have a cwise function which works directly on matrices (e.g., cwiseSqrt(), cwiseAbs()).
However, neither .array() nor .matrix() will have an impact on runtime when compiled with optimization (on any reasonable compiler).
If you consider that more readable, you can work with unaryExpr().

I agree fully with chtz's answer, and reiterate that there is no runtime cost to the "casts." You can confirm using the following toy program:
#include "Eigen/Core"
#include <iostream>
#include <chrono>
using namespace Eigen;
int main()
{
typedef MatrixXd matType;
//typedef MatrixXf matType;
volatile int vN = 1024 * 4;
int N = vN;
auto startAlloc = std::chrono::system_clock::now();
matType m = matType::Random(N, N).array().abs();
matType r1 = matType::Zero(N, N);
matType r2 = matType::Zero(N, N);
auto finishAlloc = std::chrono::system_clock::now();
r1 = m * (m.array().log()).matrix();
auto finishLog = std::chrono::system_clock::now();
r2 = m * m.unaryExpr<float(*)(float)>(&std::log);
auto finishUnary = std::chrono::system_clock::now();
std::cout << (r1 - r2).array().abs().maxCoeff() << '\n';
std::cout << "Allocation\t" << std::chrono::duration<double>(finishAlloc - startAlloc).count() << '\n';
std::cout << "Log\t\t" << std::chrono::duration<double>(finishLog - finishAlloc).count() << '\n';
std::cout << "unaryExpr\t" << std::chrono::duration<double>(finishUnary - finishLog).count() << '\n';
return 0;
}
On my computer, there is a slight advantage (~4%) to the first form which probably has to do with the way that the memory is loaded (unchecked). Beyond that, the reason for "casting" the type is to remove any ambiguities. For a clear example, consider operator *. In the matrix form, it should be considered matrix multiplication, whereas in the array form, it should be coefficient wise multiplication. The ambiguity in the case of exp and log are the matrix exponential and matrix logarithm respectively. Presumably, you want the element wise exp and log and therefore the cast is necessary.

Related

Why/how are division and multiplication equally fast here?

I'm trying to make a simple benchmarking algorithm, to compare different operations. Before I moved on to the actual functions i wanted to check a trivial case with a well-documented outcome : multiplication vs. division.
Division should lose by a fair margin from the literature i have read. When I compiled and ran the algorithm the times were just about 0. I added an accumulator that is printed to make sure the operations are actually carried out and tried again. Then i changed the loop, the numbers, shuffled and more. All in order to prevent any and all things that could cause "divide" to do anything but floating point division. To no avail. The times are still basically equal.
At this point I don't see where it could weasel its way out of the floating point divide and I give up. It wins. But I am really curious why the times are so close, what caveats/bugs i missed, and how to fix them.
(I know filling the vector with random data and then shuffling is redundant but I wanted to make sure the data was accessed and not just initialized before the loop.)
("String compares are evil", i am aware. If it is the cause of the equal times, i will gladly join the witch hunt. If not, please don't mention it.)
compile:
g++ -std=c++14 main.cc
tests:
./a.out multiply
2.42202e+09
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.218529
Average length of function : 2.18529e-07 seconds
./a.out divide
2.56147e+06
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.242061
Average length of function : 2.42061e-07 seconds
the code :
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <random>
#include <sys/time.h>
#include <sys/resource.h>
double get_time()
{
struct timeval t;
struct timezone tzp;
gettimeofday(&t, &tzp);
return t.tv_sec + t.tv_usec*1e-6;
}
double multiply(double lhs, double rhs){
return lhs * rhs;
}
double divide(double lhs, double rhs){
return lhs / rhs;
}
int main(int argc, char *argv[]){
if (argc == 1)
return 0;
double grounder = 0; //prevent optimizations
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(1.0, 100.0);
size_t loop1 = argc > 2 ? std::stoi (argv[2]) : 1000;
size_t loop2 = argc > 3 ? std::stoi (argv[3]) : 1000;
std::vector<size_t>vecL1(loop1);
std::generate(vecL1.begin(), vecL1.end(), [generator, distribution] () mutable { return distribution(generator); });
std::vector<size_t>vecL2(loop2);
std::generate(vecL2.begin(), vecL2.end(), [generator, distribution] () mutable { return distribution(generator); });
double (*fp)(double, double);
std::string function(argv[1]);
if (function == "multiply")
fp = (*multiply);
if (function == "divide")
fp = (*divide);
std::random_shuffle(vecL1.begin(), vecL1.end());
std::random_shuffle(vecL2.begin(), vecL2.end());
double t1 = get_time();
for (auto outer = vecL1.begin(); outer != vecL1.end(); outer++)
for (auto inner = vecL2.begin(); inner != vecL2.end(); inner++)
grounder += (*fp)(*inner, *outer);
double t2 = get_time();
std::cout << grounder << '\n';
std::cout << (loop1 * loop2) << '\n';
std::cout << "t1 = " << t1 << "\tt2 = " << t2
<< "\ndifference = " << (t2 - t1) << '\n';
std::cout << "Average length of function : " << (t2 - t1) * 1/(loop1 * loop2) << " seconds \n";
return 0;
}
You aren't just measuring the speed of multiplication/divide. If you put your code into https://godbolt.org/ you can see the assembly generated.
You are measuring the speed of calling a function and then doing multiply/divide inside the function. The time taken for the single multiply/divide instruction is tiny compared to the cost of the function calls so gets lost in the noise. If you move your loop to inside your function you'll probably see more of a difference. Note that with the loop inside your function your compiler may decide to vectorise your code which will still show whether there is a difference between multiply and divide but it wont be measuring the difference for the single mul/div instruction.

Should I prefer Rcpp::NumericVector over std::vector?

Is there any reason why I should prefer Rcpp::NumericVector over std::vector<double>?
For example, the two functions below
// [[Rcpp::export]]
Rcpp::NumericVector foo(const Rcpp::NumericVector& x) {
Rcpp::NumericVector tmp(x.length());
for (int i = 0; i < x.length(); i++)
tmp[i] = x[i] + 1.0;
return tmp;
}
// [[Rcpp::export]]
std::vector<double> bar(const std::vector<double>& x) {
std::vector<double> tmp(x.size());
for (int i = 0; i < x.size(); i++)
tmp[i] = x[i] + 1.0;
return tmp;
}
Are equivalent when considering their working and benchmarked performance. I understand that Rcpp offers sugar and vectorized operations, but if it is only about taking R's vector as input and returning vector as output, then would there be any difference which one of those I use? Can using std::vector<double> lead to any possible problems when interacting with R?
Are equivalent when considering their working and benchmarked performance.
I doubt that the benchmarks are accurate because going from a SEXP to std::vector<double> requires a deep copy from one data structure to another. (And as I was typing this, #DirkEddelbuettel ran a microbenchmark.)
The markup of the Rcpp object (e.g. const Rcpp::NumericVector& x) is just visual sugar. By default, the object given is a pointer and as such can easily have a ripple modification effect (see below). Thus, there is no true match that exists with const std::vector<double>& x that effectively "locks" and "passes a references".
Can using std::vector<double> lead to any possible problems when interacting with R?
In short, no. The only penalty that is paid is the transference between objects.
The gain over this transference is the fact that modifying a value of a NumericVector that is assigned to another NumericVector will not cause a domino update. In essence, each std::vector<T> is a direct copy of the other. Therefore, the following couldn't happen:
#include<Rcpp.h>
// [[Rcpp::export]]
void test_copy(){
NumericVector A = NumericVector::create(1, 2, 3);
NumericVector B = A;
Rcout << "Before: " << std::endl << "A: " << A << std::endl << "B: " << B << std::endl;
A[1] = 5; // 2 -> 5
Rcout << "After: " << std::endl << "A: " << A << std::endl << "B: " << B << std::endl;
}
Gives:
test_copy()
# Before:
# A: 1 2 3
# B: 1 2 3
# After:
# A: 1 5 3
# B: 1 5 3
Is there any reason why I should prefer Rcpp::NumericVector over std::vector<double>?
There are a few reasons:
As hinted previously, using Rcpp::NumericVector avoids a deep copy to and fro the C++ std::vector<T>.
You gain access to the sugar functions.
Ability to 'mark up' Rcpp object in C++ (e.g. adding attributes via .attr())
"If unsure, just time it."
All it takes is to add these few lines to the file you already had:
/*** R
library(microbenchmark)
x <- 1.0* 1:1e7 # make sure it is numeric
microbenchmark(foo(x), bar(x), times=100L)
*/
Then just calling sourceCpp("...yourfile...") generates the following result (plus warnings on signed/unsigned comparisons):
R> library(microbenchmark)
R> x <- 1.0* 1:1e7 # make sure it is numeric
R> microbenchmark(foo(x), bar(x), times=100L)
Unit: milliseconds
expr min lq mean median uq max neval cld
foo(x) 31.6496 31.7396 32.3967 31.7806 31.9186 54.3499 100 a
bar(x) 50.9229 51.0602 53.5471 51.1811 51.5200 147.4450 100 b
R>
Your bar() solution needs to make a copy to create a R object in the R memory pool. foo() does not. That matters for large vectors that you run over many times. Here we see a ratio of close of about 1.8.
In practice, it may not matter if you prefer one coding style over the other etc pp.

Eigen efficient inverse of symmetric positive definite matrix

In Eigen, if we have symmetric positive definite matrix A then we can calculate the inverse of A by
A.inverse();
or
A.llt().solve(I);
where I is an identity matrix of the same size as A. But is there a more efficient way to calculate the inverse of symmetric positive definite matrix?
For example if we write the Cholesky decomposition of A as A = LL^{T}, then L^{-T} L^{-1} is an inverse of A since A L^{-T} L^{-1} = LL^{T} L^{-T} L^{-1} = I (and where L^{-T} denotes the inverse of the transpose of L).
So we could obtain the Cholesky decomposition of A, calculate its inverse, and then obtain the cross-product of that inverse to find the inverse of A. But my instinct is that calculating these explicit steps will be slower than using A.llt().solve(I) as above.
And before anybody asks, I do indeed need an explicit inverse - it is a calculation for part of a Gibbs sampler.
With A.llt().solve(I), you assumes A to be a SPD matrix and apply Cholesky decomposition to solve the equation Ax=I. The mathematical procedure of solving the equation is exactly same as your explicit way. So the performance should be same if you do every step correctly.
On the other hand, with A.inverse(), you are doing general matrix inversion, which uses LU decomposition for large matrix. Thus the performance should be lower than A.llt().solve(I);.
You should profile the code for your specific problem to get the best answer. I was benchmarking code while trying to evaluate the viability of both approaches using the googletest library and this repo:
#include <gtest/gtest.h>
#define private public
#define protected public
#include <kalman/Matrix.hpp>
#include <Eigen/Cholesky>
#include <chrono>
#include <iostream>
using namespace Kalman;
using namespace std::chrono;
typedef float T;
typedef high_resolution_clock Clock;
TEST(Cholesky, inverseTiming) {
Matrix<T, Dynamic, Dynamic> L;
Matrix<T, Dynamic, Dynamic> S;
Matrix<T, Dynamic, Dynamic> Sinv_method1;
Matrix<T, Dynamic, Dynamic> Sinv_method2;
int Nmin = 2;
int Nmax = 128;
int N(Nmin);
while (N <= Nmax) {
L.resize(N, N);
L.setRandom();
S.resize(N, N);
// create a random NxN SPD matrix
S = L*L.transpose();
std::cout << "\n";
std::cout << "+++++++++++++++++++++++++ N = " << N << " +++++++++++++++++++++++++++++++++++++++" << std::endl;
auto t1 = Clock::now();
Sinv_method1.resize(N, N);
Sinv_method1 = S.inverse();
auto dt1 = Clock::now() - t1;
std::cout << "Method 1 took " << duration_cast<microseconds>(dt1).count() << " usec" << std::endl;
auto t2 = Clock::now();
Sinv_method2.resize(N, N);
Sinv_method2 = S.llt().solve(Matrix<T, Dynamic, Dynamic>::Identity(N, N));
auto dt2 = Clock::now() - t2;
std::cout << "Method 2 took " << duration_cast<microseconds>(dt2).count() << " usec" << std::endl;
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
EXPECT_NEAR( Sinv_method1(i, j), Sinv_method2(i, j), 1e-3 );
}
}
N *= 2;
std::cout << "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" << std::endl;
std::cout << "\n";
}
}
What the above example showed me was that, for my size problem, the speedup was negligible using method2 whereas the lack of accuracy (using the .inverse() call as the benchmark) was noticeable.

Nanoflann radius search

I have a doubt regarding the parameter search_radius in nanoflann's radiusSearch function. My code is this:
#include <iostream>
#include <vector>
#include <map>
#include "nanoflann.hpp"
#include "Eigen/Dense"
int main()
{
Eigen::MatrixXf mat(7, 2);
mat(0,0) = 0.0; mat(0,1) = 0.0;
mat(1,0) = 0.1; mat(1,1) = 0.0;
mat(2,0) = -0.1; mat(2,1) = 0.0;
mat(3,0) = 0.2; mat(3,1) = 0.0;
mat(4,0) = -0.2; mat(4,1) = 0.0;
mat(5,0) = 0.5; mat(5,1) = 0.0;
mat(6,0) = -0.5; mat(6,1) = 0.0;
std::vector<float> query_pt(2);
query_pt[0] = 0.0;
query_pt[1] = 0.0;
typedef nanoflann::KDTreeEigenMatrixAdaptor<Eigen::MatrixXf> KDTree;
KDTree index(2, mat, 10);
index.index->buildIndex();
{ // Find nearest neighbors in radius
const float search_radius = 0.1f;
std::vector<std::pair<size_t, float> > matches;
nanoflann::SearchParams params;
const size_t nMatches = index.index->radiusSearch(&query_pt[0], search_radius, matches, params);
std::cout << "RadiusSearch(): radius = " << search_radius << " -> "
<< nMatches << " matches" << std::endl;
for(size_t i = 0; i < nMatches; i++)
std::cout << "Idx[" << i << "] = " << matches[i].first
<< " dist[" << i << "] = " << matches[i].second << std::endl;
std::cout << std::endl;
}
}
What I want is to have the points within a radius of 0.1, so, what I expected was the first three elements in the matrix but to my surprise it returned the first 5 elements. Checking the distances return it seems to me that it is not the actual distance but the distance-squared (right?) so I squared the radius to get what I expected but unfortunately it returns only the first point.
So I increased a little bit the radius from 0.1^2 = 0.01 to 0.02 and finally got the points I wanted.
Now, the question is, shouldn't the points laying on the perimeter of the neighborhood be included? Where can I change this condition in nanoflann?
The full definition of KDTreeEigenMatrixAdaptor starts like this:
template <class MatrixType, int DIM = -1,
class Distance = nanoflann::metric_L2,
typename IndexType = size_t>
struct KDTreeEigenMatrixAdaptor
{
//...
So, yes: the default metric is the squared Euclidean distance, the L2_Adaptor struct, and documented as follows:
Squared Euclidean distance functor (generic version, optimized for high-dimensionality data sets).
As for the second issue, there are two aspects. First one is that you should not rely on equality when it comes to floating point numbers (obligatory reference: David Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Surveys, 1991).
Second is that in principle, you are right. nanoflann is based on FLANN, in which's source code you may find the implementation of CountRadiusResultSet class, used by the radiusSearch search method. Its key method has the following implementation:
void addPoint(DistanceType dist, size_t index)
{
if (dist<radius) {
count++;
}
}
Whereas it seems that a common definition of this problem involves "less than or equal", as for example in the following reference (Matthew T. Dickerson, David Eppstein, Algorithms for Proximity Problems in Higher Dimensions, Computational Geometry, 1996):
Problem 1. (Fixed-Radius Near-Neighbors Search) Given a finite set S of n distinct points in Rd and a distance 𝛿. For each point p ∈ S report all pairs of points (p,q), q ∈ S such that the distance from p to q is less than or equal to 𝛿.
(the last emphasis by me)
Still, that's mathematics and in Computer Science the floating-point arithmetic problems effectively inhibit thinking about equality in such a strict manner.
It seems that your only choice here is to slightly increase the radius, because the usage of the CountRadiusResultSet class is hard-coded in radiusSearch method implementation inside FLANN.

gmp pow with two mpf_t

Is there an implementation in gmp that allows a power function with only mpf_t's as argument? I want to do this:
mpf_t s ;
mpf_init (s);
mpf_set_d (s,boost::lexical_cast<double>(sec));
mpf_t ten,mil;
mpf_init(ten);
mpf_init(mil);
mpf_set_d(ten,10.0);
mpf_set_d(mil,0.001);
mpf_div(s,s,ten);
mpf_pow_ui(s,ten,s); //<- this doesn't work because it need an unsigned int as third argument but I need it with a mpf_t
mpf_mul(s,s,mil);
I don't think so, at least not with GNU Multi-Precision library only. But you could use mpfr, which is based on gmp and supports a mpfr_pow (mpfr_t rop, mpfr_t op1, mpfr_t op2, mpfr_rnd_t rnd) function. See here.
If you decide to do that, this could also be helpful.
There is one interesting workaround using square root mpf_sqrt_ui. From math we know that x^y = Sqrt(x)^(y * 2), so we can multiply Y many times by 2 and take square root of X same amount of times.
Thus by multiplying Y by 2 you may make it almost whole integer. And as you know there is mpf_pow_ui that does powering into whole integer.
Following code does all this. Don't forget that b should be set to high precision only to allow many times square rooting.
For simplicity I used mpf_class, this is C++ interface to mpf.
I did output to console actual mpf result value and reference value computed through std::pow from .
To avoid setting high precision is a bit more difficult, but possible e.g. through Taylor Serie like
Sqrt(1 + x) = 1 + 1/2*x - 1/8*x^2 + 1/16*x^3 - 5/128*x^4 + ...
Try it online!
#include <cmath>
#include <iostream>
#include <iomanip>
#include <gmpxx.h>
int main() {
mpf_class const b0 = 9.87654321, p0 = 1.23456789;
mpf_class b = b0, p = p0;
b.set_prec(1 << 7); p.set_prec(1 << 7);
int const sqrt_cnt = 48;
for (int i = 0; i < sqrt_cnt; ++i)
mpf_sqrt(b.get_mpf_t(), b.get_mpf_t());
mpf_mul_2exp(p.get_mpf_t(), p.get_mpf_t(), sqrt_cnt);
mpf_pow_ui(b.get_mpf_t(), b.get_mpf_t(), std::lround(p.get_d()));
std::cout << std::fixed << std::setprecision(12) << "Actual "
<< b.get_d() << ", Reference " << std::pow(b0.get_d(), p0.get_d())
<< std::endl;
}
Output:
Actual 16.900803674719, Reference 16.900803674719