Eigen: Slow access to columns of Matrix 4 - c++

I am using Eigen for operations similar to Cholesky update, implying a lot of AXPY (sum plus multiplication by a scalar) on the columns of a fixed size matrix, typically a Matrix4d. In brief, it is 3 times more expensive to access to the columns of a Matrix 4 than to a Vector 4.
Typically, the code below:
for(int i=0;i<4;++i ) L.col(0) += x*y[i];
is 3 times less efficient than the code below:
for(int i=0;i<4;++i ) l4 += x*y[i];
where L is typically a matrix of size 4, x, y and l4 are vectors of size 4.
Moreover, the time spent in the first line of code is not depending on the matrix storage organization (either RowMajor of ColMajor).
On a Intel i7 (2.5GHz), it takes about 0.007us for vector operations, and 0.02us for matrix operations (timings are done by repeating 100000 times the same operation). My application would need thousands of such operation in timings hopefully far below the millisecond.
Question: I am doing something improperly when accessing columns of my 4x4 matrix? Is there something to do to make the first line of code more efficient?
Full code used for timings is below:
#include <iostream>
#include <Eigen/Core>
#include <vector>
#include <sys/time.h>
typedef Eigen::Matrix<double,4,1,Eigen::ColMajor> Vector4;
//typedef Eigen::Matrix<double,4,4,Eigen::RowMajor,4,4> Matrix4;
typedef Eigen::Matrix<double,4,4,Eigen::ColMajor,4,4> Matrix4;
inline double operator- ( const struct timeval & t1,const struct timeval & t0)
{
/* TODO: double check the double conversion from long (on 64x). */
return double(t1.tv_sec - t0.tv_sec)+1e-6*double(t1.tv_usec - t0.tv_usec);
}
void sumCols( Matrix4 & L,
Vector4 & x4,
Vector4 & y)
{
for(int i=0;i<4;++i )
{
L.col(0) += x4*y[i];
}
}
void sumVec( Vector4 & L,
Vector4 & x4,
Vector4 & y)
{
for(int i=0;i<4;++i )
{
//L.tail(4-i) += x4.tail(4-i)*y[i];
L += x4 *y[i];
}
}
int main()
{
using namespace Eigen;
const int NBT = 1000000;
struct timeval t0,t1;
std::vector< Vector4> x4s(NBT);
std::vector< Vector4> y4s(NBT);
std::vector< Vector4> z4s(NBT);
std::vector< Matrix4> L4s(NBT);
for(int i=0;i<NBT;++i)
{
x4s[i] = Vector4::Random();
y4s[i] = Vector4::Random();
L4s[i] = Matrix4::Random();
}
int sample = int(z4s[55][2]/10*NBT);
std::cout << "*** SAMPLE = " << sample << std::endl;
gettimeofday(&t0,NULL);
for(int i=0;i<NBT;++i)
{
sumCols(L4s[i], x4s[i], y4s[i]);
}
gettimeofday(&t1,NULL);
std::cout << (t1-t0) << std::endl;
std::cout << "\t\t\t\t\t\t\tForce check" << L4s[sample](1,0) << std::endl;
gettimeofday(&t0,NULL);
for(int i=0;i<NBT;++i)
{
sumVec(z4s[i], x4s[i], y4s[i]);
}
gettimeofday(&t1,NULL);
std::cout << (t1-t0) << std::endl;
std::cout << "\t\t\t\t\t\t\tForce check" << z4s[sample][2] << std::endl;
return -1;
}

As I said in a comment, the generated assembly is exactly the same for both functions.
The problem is that your benchmark is biased in the sense that L4s is 4 times bigger than z4s, and you thus get more cache misses in the matrix case than in the vector case.

Related

How to determine the original matrix size from the result generated from `fftw_dft_r2c`?

For fftw3, if you perform Reals 2D FFT on a matrix A with size of [m x n], you would get a complex matrix B with size of [m x (n/2 + 1)]. If we only have B, how to determine A's shape, i.e. rows and cols of A?
For 1D FFT problems, I get to know that performing fft on vector with odd size results in a complex vector ends with a real number, i.e. vector_fft[-1].imag() == 0. And this can be the key to determine the original vector size.
However, for 2D FFT, problem seems to be more complicated, there is no such evident feature for 2D FFT-ed array. I read the manual of fftw3, and still be confused with how fftw compressed the conjugated data in result matrix.
#include <iostream>
#include <Eigen/Eigen>
#include <fftw3.h>
#include <complex>
#define EIGEN_DEFAULT_TO_ROW_MAJOR
using namespace Eigen;
using Cmpl = std::complex<double>;
using Matd = Matrix<double, Dynamic, Dynamic, RowMajor>;
using Vecd = Matrix<double, 1, Dynamic, RowMajor>;
using Veccd = Matrix< Cmpl, 1, Dynamic, RowMajor>;
using Matcd = Matrix< Cmpl, Dynamic, Dynamic, RowMajor>;
Veccd fft_1d_r2c(const Vecd& in) {
Veccd out(in.size() / 2 + 1);
auto plan = fftw_plan_dft_r2c_1d(in.size(),
(double*)in.data(), (fftw_complex*)out.data(), FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
return out;
}
Veccd fft_1d_c2c(const Veccd& in) {
Veccd out(in.size());
auto plan = fftw_plan_dft_1d(in.size(),
(fftw_complex*)in.data(), (fftw_complex*)out.data(),
FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
return out;
}
Matcd fft_2d_r2c(const Matd& in) {
Matcd out(in.rows(), in.cols() / 2 + 1);
auto plan = fftw_plan_dft_r2c_2d(in.rows(), in.cols(),
(double*)in.data(), (fftw_complex*)out.data(), FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
return out;
}
Matcd fft_2d_c2c(const Matcd& in) {
Matcd out(in.rows(), in.cols());
auto plan = fftw_plan_dft_2d(in.rows(), in.cols(),
(fftw_complex*)in.data(), (fftw_complex*)out.data(),
FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
return out;
}
void test_1d(const int size) {
Vecd vec{size};
vec.setRandom();
Veccd vec_fft = fft_1d_c2c(vec.cast<Cmpl>());
Veccd vec_rfft = fft_1d_r2c(vec);
std::cout << "Original Vector: \n" << vec << std::endl << std::endl
<< "fftw_r2c of vec: \n" << vec_rfft << std::endl << std::endl;
}
void test_2d(const int rows, const int cols) {
Matd mat{rows, cols};
mat.setRandom();
Matcd mat_rfft = fft_2d_r2c(mat);
Matcd mat_fft = fft_2d_c2c(mat.cast<Cmpl>());
std::cout << "Original Matrix: \n" << mat << std::endl << std::endl
<< "fftw_r2c of mat: \n" << mat_rfft << std::endl << std::endl;
std::cout << "fftw_c2c of mat: \n" << mat_fft << std::endl;
}
int main(int argc, char* argv[]) {
std::cout << "For 1D fft problems" << std::endl;
test_1d(9);
test_1d(10);
std::cout << "For 2D fft problems" << std::endl;
test_2d(6, 5);
test_2d(6, 4);
return 0;
}
For the matrix returned by fftw_r2c, how to determine the shape of original matrix?

Building a large boost unordered_map with cpp_int

I am writing some code in c++ for a class assignment that requires work with multiprecision library such as boost. Basically, I need to build a hash table with some large integers and then lookup a certain value in that table.
When I use h, g, p that are commented out - the code runs fine and very quickly. Once I switch to those that are not commented out, it throws a memory exception at line: hash_str>::iterator got = mp.find(lkp);
I am just starting out with c++ and pretty sure that something is way off, because this should run rather quickly, even with large numbers.
#include <boost/unordered_map.hpp>
#include <boost/multiprecision/cpp_int.hpp>
#include <boost/math/special_functions/pow.hpp>
using namespace std;
using namespace boost::multiprecision;
template <typename T>
struct hash_str
{
size_t operator()( const T& t ) const
{
return std::hash<std::string>()
( t.str() );
}
};
int main()
{
boost::unordered_map<cpp_int, cpp_int, hash_str<cpp_int>> mp;
//boost::unordered_map<hash_str<cpp_int>, cpp_int, hash_str<cpp_int>> mp;
cpp_int k;
cpp_int h( "3239475104050450443565264378728065788649097520952449527834792452971981976143292558073856937958553180532878928001494706097394108577585732452307673444020333" );
cpp_int g( "11717829880366207009516117596335367088558084999998952205599979459063929499736583746670572176471460312928594829675428279466566527115212748467589894601965568" );
//cpp_int g = 1010343267;
//cpp_int h = 857348958;
//cpp_int p = 1073676287;
cpp_int p( "13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690031858186486050853753882811946569946433649006084171" );
int b = pow( 2, 20 );
cpp_int denom;
cpp_int inv = powm( g, p - 2, p );
//building a hash table of all values h/g^x1
for ( cpp_int x = 1; x < b; ++x )
{
// go through all 2^20 values up to b, calculate the function h/g^x1,
// then hash it to put into table
denom = powm( inv, x, p );
k = ( h *denom ) % p;
mp.insert( std::make_pair( k, x ) );
}
cpp_int lkp;
for ( int v = 1; v < b; ++v )
{
//cpp_int gb = pow(g, b);
lkp = powm( g, v*b, p );
//looking for a match for g^b^x0 in map mp; when found we need to find x
//which is x1 and then calc 'x'
boost::unordered::unordered_map<cpp_int, cpp_int, hash_str<cpp_int>>::iterator got = mp.find( lkp );
// Check if iterator points to end of map or if we found our value
if ( got != mp.end() )
{
std::cout << "Element Found - ";
//std::cout << got->first << "::" << got->second << std::endl;
}
/*else
{
std::cout << "Element Not Found" << std::endl;
}*/
}
return 0;
}
Just in case, here is the exception I get:
Unhandled exception at 0x768F2F71 in MiM.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x0109EF5C.
The hash function is pretty atrocious because it allocates a temporary string only to hash it. The string will have log(bits)/log(10) bytes of length.
The point of the hash is that it's a relatively fast way to compare numbers. With a hash that expensive, you're better of with a regular Tree container (std::map<> e.g.).
I haven't checked your formulas (especially around h/g^x1 because I'm not even sure that x represents x1). Outside of that issue,
I think there is a correctness issue with v * b overflowing the int capacity at least if you're on a 32-bit integer compiler.
I've cleaned up a little bit and it runs
#include <boost/math/special_functions/pow.hpp>
#include <boost/multiprecision/cpp_int.hpp>
#include <boost/unordered_map.hpp>
#include <chrono>
namespace bmp = boost::multiprecision;
using namespace std::chrono_literals;
using Clock = std::chrono::high_resolution_clock;
template <typename T> struct hash_str {
size_t operator()(const T &t) const { return std::hash<std::string>()(t.str()); }
};
template <typename T> struct hash_bin {
size_t operator()(const T &t) const {
return boost::hash_range(t.backend().limbs(), t.backend().limbs()+t.backend().size());
}
};
int main() {
using bmp::cpp_int;
boost::unordered_map<cpp_int, cpp_int, hash_bin<cpp_int> > mp;
#if 1
cpp_int const h("32394751040504504435652643787280657886490975209524495278347924529719819761432925580738569379585531805328"
"78928001494706097394108577585732452307673444020333");
cpp_int const g("11717829880366207009516117596335367088558084999998952205599979459063929499736583746670572176471460312928"
"594829675428279466566527115212748467589894601965568");
cpp_int const p("13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690"
"031858186486050853753882811946569946433649006084171");
#else
cpp_int const g = 1010343267;
cpp_int const h = 857348958;
cpp_int const p = 1073676287;
#endif
int constexpr b = 1 << 20;
cpp_int const inv = powm(g, p - 2, p);
{
auto s = Clock::now();
// building a hash table of all values h/g^x1
for (cpp_int x = 1; x < b; ++x) {
// go through [1, b), calculate the function h/g^x1,
// then hash it to put into table
cpp_int denom = powm(inv, x, p);
cpp_int k = (h * denom) % p;
mp.emplace(std::move(k), x);
}
std::cout << "Built map in " << (Clock::now() - s)/1.0s << "s\n";
}
{
auto s = Clock::now();
for (cpp_int v = 1; v < b; ++v) {
//std::cout << "v=" << v << " b=" << b << "\n";
// cpp_int gb = pow(g, b);
cpp_int const lkp = powm(g, v * b, p);
// looking for a match for g^b^x0 in map mp; when found we need to find x
// which is x1 and then calc 'x'
auto got = mp.find(lkp);
// Check if iterator points to end of map or if we found our value
if (got != mp.end()) {
std::cout << "Element Found - ";
//std::cout << got->first << " :: " << got->second << "\n";
}
}
std::cout << "Completed queries in " << (Clock::now() - s)/1.0s << "s\n";
}
}
It runs in 1m4s for me.
Built map in 24.3809s
Element Found - Completed queries in 39.2463s
...
Using hash_str instead of hash_bin takes 1m13s:
Built map in 30.3923s
Element Found - Completed queries in 42.488s

Finding the median value of a vector using C++

I'm a programming student, and for a project I'm working on, on of the things I have to do is compute the median value of a vector of int values and must be done by passing it through functions. Also the vector is initially generated randomly using the C++ random generator mt19937 which i have already written down in my code.I'm to do this using the sort function and vector member functions such as .begin(), .end(), and .size().
I'm supposed to make sure I find the median value of the vector and then output it
And I'm Stuck, below I have included my attempt. So where am I going wrong? I would appreciate if you would be willing to give me some pointers or resources to get going in the right direction.
Code:
#include<iostream>
#include<vector>
#include<cstdlib>
#include<ctime>
#include<random>
#include<vector>
#include<cstdlib>
#include<ctime>
#include<random>
using namespace std;
double find_median(vector<double>);
double find_median(vector<double> len)
{
{
int i;
double temp;
int n=len.size();
int mid;
double median;
bool swap;
do
{
swap = false;
for (i = 0; i< len.size()-1; i++)
{
if (len[i] > len[i + 1])
{
temp = len[i];
len[i] = len[i + 1];
len[i + 1] = temp;
swap = true;
}
}
}
while (swap);
for (i=0; i<len.size(); i++)
{
if (len[i]>len[i+1])
{
temp=len[i];
len[i]=len[i+1];
len[i+1]=temp;
}
mid=len.size()/2;
if (mid%2==0)
{
median= len[i]+len[i+1];
}
else
{
median= (len[i]+0.5);
}
}
return median;
}
}
int main()
{
int n,i;
cout<<"Input the vector size: "<<endl;
cin>>n;
vector <double> foo(n);
mt19937 rand_generator;
rand_generator.seed(time(0));
uniform_real_distribution<double> rand_distribution(0,0.8);
cout<<"original vector: "<<" ";
for (i=0; i<n; i++)
{
double rand_num=rand_distribution(rand_generator);
foo[i]=rand_num;
cout<<foo[i]<<" ";
}
double median;
median=find_median(foo);
cout<<endl;
cout<<"The median of the vector is: "<<" ";
cout<<median<<endl;
}
The median is given by
const auto median_it = len.begin() + len.size() / 2;
std::nth_element(len.begin(), median_it , len.end());
auto median = *median_it;
For even numbers (size of vector) you need to be a bit more precise. E.g., you can use
assert(!len.empty());
if (len.size() % 2 == 0) {
const auto median_it1 = len.begin() + len.size() / 2 - 1;
const auto median_it2 = len.begin() + len.size() / 2;
std::nth_element(len.begin(), median_it1 , len.end());
const auto e1 = *median_it1;
std::nth_element(len.begin(), median_it2 , len.end());
const auto e2 = *median_it2;
return (e1 + e2) / 2;
} else {
const auto median_it = len.begin() + len.size() / 2;
std::nth_element(len.begin(), median_it , len.end());
return *median_it;
}
There are of course many different ways how we can get element e1. We could also use max or whatever we want. But this line is important because nth_element only places the nth element correctly, the remaining elements are ordered before or after this element, depending on whether they are larger or smaller. This range is unsorted.
This code is guaranteed to have linear complexity on average, i.e., O(N), therefore it is asymptotically better than sort, which is O(N log N).
Regarding your code:
for (i=0; i<len.size(); i++){
if (len[i]>len[i+1])
This will not work, as you access len[len.size()] in the last iteration which does not exist.
std::sort(len.begin(), len.end());
double median = len[len.size() / 2];
will do it. You might need to take the average of the middle two elements if size() is even, depending on your requirements:
0.5 * (len[len.size() / 2 - 1] + len[len.size() / 2]);
Instead of trying to do everything at once, you should start with simple test cases and work upwards:
#include<vector>
double find_median(std::vector<double> len);
// Return the number of failures - shell interprets 0 as 'success',
// which suits us perfectly.
int main()
{
return find_median({0, 1, 1, 2}) != 1;
}
This already fails with your code (even after fixing i to be an unsigned type), so you could start debugging (even 'dry' debugging, where you trace the code through on paper; that's probably enough here).
I do note that with a smaller test case, such as {0, 1, 2}, I get a crash rather than merely failing the test, so there's something that really needs to be fixed.
Let's replace the implementation with one based on overseas's answer:
#include <algorithm>
#include <limits>
#include <vector>
double find_median(std::vector<double> len)
{
if (len.size() < 1)
return std::numeric_limits<double>::signaling_NaN();
const auto alpha = len.begin();
const auto omega = len.end();
// Find the two middle positions (they will be the same if size is odd)
const auto i1 = alpha + (len.size()-1) / 2;
const auto i2 = alpha + len.size() / 2;
// Partial sort to place the correct elements at those indexes (it's okay to modify the vector,
// as we've been given a copy; otherwise, we could use std::partial_sort_copy to populate a
// temporary vector).
std::nth_element(alpha, i1, omega);
std::nth_element(i1, i2, omega);
return 0.5 * (*i1 + *i2);
}
Now, our test passes. We can write a helper method to allow us to create more tests:
#include <iostream>
bool test_median(const std::vector<double>& v, double expected)
{
auto actual = find_median(v);
if (abs(expected - actual) > 0.01) {
std::cerr << actual << " - expected " << expected << std::endl;
return true;
} else {
std::cout << actual << std::endl;
return false;
}
}
int main()
{
return test_median({0, 1, 1, 2}, 1)
+ test_median({5}, 5)
+ test_median({5, 5, 5, 0, 0, 0, 1, 2}, 1.5);
}
Once you have the simple test cases working, you can manage more complex ones. Only then is it time to create a large array of random values to see how well it scales:
#include <ctime>
#include <functional>
#include <random>
int main(int argc, char **argv)
{
std::vector<double> foo;
const int n = argc > 1 ? std::stoi(argv[1]) : 10;
foo.reserve(n);
std::mt19937 rand_generator(std::time(0));
std::uniform_real_distribution<double> rand_distribution(0,0.8);
std::generate_n(std::back_inserter(foo), n, std::bind(rand_distribution, rand_generator));
std::cout << "Vector:";
for (auto v: foo)
std::cout << ' ' << v;
std::cout << "\nMedian = " << find_median(foo) << std::endl;
}
(I've taken the number of elements as a command-line argument; that's more convenient in my build than reading it from cin). Notice that instead of allocating n doubles in the vector, we simply reserve capacity for them, but don't create any until needed.
For fun and kicks, we can now make find_median() generic. I'll leave that as an exercise; I suggest you start with:
typename<class Iterator>
auto find_median(Iterator alpha, Iterator omega)
{
using value_type = typename Iterator::value_type;
if (alpha == omega)
return std::numeric_limits<value_type>::signaling_NaN();
}

Eigen efficient inverse of symmetric positive definite matrix

In Eigen, if we have symmetric positive definite matrix A then we can calculate the inverse of A by
A.inverse();
or
A.llt().solve(I);
where I is an identity matrix of the same size as A. But is there a more efficient way to calculate the inverse of symmetric positive definite matrix?
For example if we write the Cholesky decomposition of A as A = LL^{T}, then L^{-T} L^{-1} is an inverse of A since A L^{-T} L^{-1} = LL^{T} L^{-T} L^{-1} = I (and where L^{-T} denotes the inverse of the transpose of L).
So we could obtain the Cholesky decomposition of A, calculate its inverse, and then obtain the cross-product of that inverse to find the inverse of A. But my instinct is that calculating these explicit steps will be slower than using A.llt().solve(I) as above.
And before anybody asks, I do indeed need an explicit inverse - it is a calculation for part of a Gibbs sampler.
With A.llt().solve(I), you assumes A to be a SPD matrix and apply Cholesky decomposition to solve the equation Ax=I. The mathematical procedure of solving the equation is exactly same as your explicit way. So the performance should be same if you do every step correctly.
On the other hand, with A.inverse(), you are doing general matrix inversion, which uses LU decomposition for large matrix. Thus the performance should be lower than A.llt().solve(I);.
You should profile the code for your specific problem to get the best answer. I was benchmarking code while trying to evaluate the viability of both approaches using the googletest library and this repo:
#include <gtest/gtest.h>
#define private public
#define protected public
#include <kalman/Matrix.hpp>
#include <Eigen/Cholesky>
#include <chrono>
#include <iostream>
using namespace Kalman;
using namespace std::chrono;
typedef float T;
typedef high_resolution_clock Clock;
TEST(Cholesky, inverseTiming) {
Matrix<T, Dynamic, Dynamic> L;
Matrix<T, Dynamic, Dynamic> S;
Matrix<T, Dynamic, Dynamic> Sinv_method1;
Matrix<T, Dynamic, Dynamic> Sinv_method2;
int Nmin = 2;
int Nmax = 128;
int N(Nmin);
while (N <= Nmax) {
L.resize(N, N);
L.setRandom();
S.resize(N, N);
// create a random NxN SPD matrix
S = L*L.transpose();
std::cout << "\n";
std::cout << "+++++++++++++++++++++++++ N = " << N << " +++++++++++++++++++++++++++++++++++++++" << std::endl;
auto t1 = Clock::now();
Sinv_method1.resize(N, N);
Sinv_method1 = S.inverse();
auto dt1 = Clock::now() - t1;
std::cout << "Method 1 took " << duration_cast<microseconds>(dt1).count() << " usec" << std::endl;
auto t2 = Clock::now();
Sinv_method2.resize(N, N);
Sinv_method2 = S.llt().solve(Matrix<T, Dynamic, Dynamic>::Identity(N, N));
auto dt2 = Clock::now() - t2;
std::cout << "Method 2 took " << duration_cast<microseconds>(dt2).count() << " usec" << std::endl;
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
EXPECT_NEAR( Sinv_method1(i, j), Sinv_method2(i, j), 1e-3 );
}
}
N *= 2;
std::cout << "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" << std::endl;
std::cout << "\n";
}
}
What the above example showed me was that, for my size problem, the speedup was negligible using method2 whereas the lack of accuracy (using the .inverse() call as the benchmark) was noticeable.

thrust vector distance calculation

Consider the following dataset and centroids. There are 7 individuals and two means each with 8 dimensions. They are stored row major order.
short dim = 8;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114
};
I want to calculate each euclidean distances. c1 - d1, c1 - d2 ....
On CPU I would do:
float dist = 0.0, dist_sqrt;
for(int i = 0; i < 2; i++)
for(int j = 0; j < 7; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
std::cout << dist_sqrt << std::endl;
}
Is there any built in solution of vector distance calculation in THRUST?
It can be done in thrust. Explaining how will be rather involved, and the code is rather dense.
The key observation to start with is that the core operation can be done via a transformed reduction. The thrust transform operation is used to perform the elementwise subtraction of the vectors (individual-centroid) and squaring of each result, and the reduction sums the results together to produce the square of the euclidean distance. The starting point for this operation is thrust::reduce_by_key, but it gets rather involved to present the data correctly to reduce_by_key.
The final results are produced by taking the square root of each result from above, and we can use an ordinary thrust::transform for this.
The above is a summary description of the only 2 lines of thrust code that do all the work. However, the first line has considerable complexity to it. In order to exploit parallelism, the approach I took was to virtually "lay out" the necessary vectors in sequence, to be presented to reduce_by_key. To take a simple example, suppose we have 2 centroids and 4 individuals, and suppose our dimension is 2.
centroid 0: C00 C01
centroid 1: C10 C11
individ 0: I00 I01
individ 1: I10 I11
individ 2: I20 I21
individ 3: I30 I31
We can "lay out" the vectors like this:
C00 C01 C00 C01 C00 C01 C00 C01 C10 C11 C10 C11 C10 C11 C10 C11
I00 I01 I10 I11 I20 I21 I30 I31 I00 I01 I10 I11 I20 I21 I30 I31
To facilitate the reduce_by_key, we will also need to generate key values to delineate the vectors:
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
The above data "laid-out" data sets can be quite large, and we don't want to incur storage and retrieval cost, so we will generate these "on-the-fly" using thrust's collection of fancy iterators. This is where things get quite dense. With the above strategy in mind, we will use thrust::reduce_by_key to do the work. We'll create a custom functor provided to a transform_iterator to do the subtraction (and squaring) of the I and C vectors, which will be zipped together for this purpose. The "lay out" of the vectors will be created on the fly using permutation iterators with additional custom index-creation functors, to help with the replicated patterns in each of I and C.
Therefore, working from the "inside out", the sequence of steps is as follows:
for both I (data) and C (centr) use a counting_iterator combined with a custom indexing functor inside of a transform_iterator to produce the indexing sequences we will need.
using the indexing sequences created in step 1 and the base I and C vectors, virtually "lay out" the vectors via a permutation_iterator (one for each laid-out vector).
zip the 2 "laid out" virtual I and C vectors together, to create a <float, float> tuple vector (virtual).
take the zip_iterator from step 3, and combine with a custom distance-calculation functor ((I-C)^2) in a transform_iterator
use another transform_iterator, combining a counting_iterator with a custom key-generating functor, to produce the key sequence (virtual)
pass the iterators in steps 4 and 5 to reduce_by_keyas the inputs (keys, values) to be reduced. The output vectors for reduce_by_key are also keys and values. We don't need the keys, so we'll use a discard_iterator to dump those. The values we will save.
The above steps are all accomplished in a single line of thrust code.
Here's a code illustrating the above:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#define MAX_DATA 100000000
#define MAX_CENT 5000
#define TOL 0.001
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
unsigned verify(float *d1, float *d2, int len){
unsigned pass = 1;
for (int i = 0; i < len; i++)
if (fabsf(d1[i] - d2[i]) > TOL){
std::cout << "mismatch at: " << i << " val1: " << d1[i] << " val2: " << d2[i] << std::endl;
pass = 0;
break;}
return pass;
}
void eucl_dist_cpu(const float *centroids, const float *data, float *rdist, int num_centroids, int dim, int num_data, int print){
int out_idx = 0;
float dist, dist_sqrt;
for(int i = 0; i < num_centroids; i++)
for(int j = 0; j < num_data; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
rdist[out_idx++] = dist_sqrt;
if (print) std::cout << dist_sqrt << ", ";
}
if (print) std::cout << std::endl;
}
struct dkeygen : public thrust::unary_function<int, int>
{
int dim;
int numd;
dkeygen(const int _dim, const int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val/dim);
}
};
typedef thrust::tuple<float, float> mytuple;
struct my_dist : public thrust::unary_function<mytuple, float>
{
__host__ __device__ float operator()(const mytuple &my_tuple) const {
float temp = thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple);
return temp*temp;
}
};
struct d_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
d_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % (dim*numd));
}
};
struct c_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
c_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % dim) + (dim * (val/(dim*numd)));
}
};
struct my_sqrt : public thrust::unary_function<float, float>
{
__host__ __device__ float operator()(const float val) const {
return sqrtf(val);
}
};
unsigned long long eucl_dist_thrust(thrust::host_vector<float> &centroids, thrust::host_vector<float> &data, thrust::host_vector<float> &dist, int num_centroids, int dim, int num_data, int print){
thrust::device_vector<float> d_data = data;
thrust::device_vector<float> d_centr = centroids;
thrust::device_vector<float> values_out(num_centroids*num_data);
unsigned long long compute_time = dtime_usec(0);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), dkeygen(dim, num_data)), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(dim*num_data*num_centroids), dkeygen(dim, num_data)),thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_centr.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), c_idx(dim, num_data))), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), d_idx(dim, num_data))))), my_dist()), thrust::make_discard_iterator(), values_out.begin());
thrust::transform(values_out.begin(), values_out.end(), values_out.begin(), my_sqrt());
cudaDeviceSynchronize();
compute_time = dtime_usec(compute_time);
if (print){
thrust::copy(values_out.begin(), values_out.end(), std::ostream_iterator<float>(std::cout, ", "));
std::cout << std::endl;
}
thrust::copy(values_out.begin(), values_out.end(), dist.begin());
return compute_time;
}
int main(int argc, char *argv[]){
int dim = 8;
int num_centroids = 2;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
int num_data = 8;
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114,
0.721, 0.555, 0.979, 0.412, 0.007, 0.501, 0.844, 0.234
};
std::cout << "cpu results: " << std::endl;
float dist[num_data*num_centroids];
eucl_dist_cpu(centroids, data, dist, num_centroids, dim, num_data, 1);
thrust::host_vector<float> h_data(data, data + (sizeof(data)/sizeof(float)));
thrust::host_vector<float> h_centr(centroids, centroids + (sizeof(centroids)/sizeof(float)));
thrust::host_vector<float> h_dist(num_centroids*num_data);
std::cout << "gpu results: " << std::endl;
eucl_dist_thrust(h_centr, h_data, h_dist, num_centroids, dim, num_data, 1);
float *data2, *centroids2, *dist2;
num_centroids = 10;
num_data = 1000000;
if (argc > 2) {
num_centroids = atoi(argv[1]);
num_data = atoi(argv[2]);
if ((num_centroids < 1) || (num_centroids > MAX_CENT)) {std::cout << "Num centroids out of range" << std::endl; return 1;}
if ((num_data < 1) || (num_data > MAX_DATA)) {std::cout << "Num data out of range" << std::endl; return 1;}
if (num_data * dim * num_centroids > 2000000000) {std::cout << "data set out of range" << std::endl; return 1;}}
std::cout << "Num Data: " << num_data << std::endl;
std::cout << "Num Cent: " << num_centroids << std::endl;
std::cout << "result size: " << ((num_data*num_centroids*4)/1048576) << " Mbytes" << std::endl;
data2 = new float[dim*num_data];
centroids2 = new float[dim*num_centroids];
dist2 = new float[num_data*num_centroids];
for (int i = 0; i < dim*num_data; i++) data2[i] = rand()/(float)RAND_MAX;
for (int i = 0; i < dim*num_centroids; i++) centroids2[i] = rand()/(float)RAND_MAX;
unsigned long long dtime = dtime_usec(0);
eucl_dist_cpu(centroids2, data2, dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "cpu time: " << dtime/(float)USECPSEC << "s" << std::endl;
thrust::host_vector<float> h_data2(data2, data2 + (dim*num_data));
thrust::host_vector<float> h_centr2(centroids2, centroids2 + (dim*num_centroids));
thrust::host_vector<float> h_dist2(num_data*num_centroids);
dtime = dtime_usec(0);
unsigned long long ctime = eucl_dist_thrust(h_centr2, h_data2, h_dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "gpu total time: " << dtime/(float)USECPSEC << "s, gpu compute time: " << ctime/(float)USECPSEC << "s" << std::endl;
if (!verify(dist2, &(h_dist2[0]), num_data*num_centroids)) {std::cout << "Verification failure." << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
Notes:
The code is set up to do 2 passes, a short one using a data set similar to yours, with printout for visual check. Then a larger data set can be entered, via command-line sizing parameters (number of centroids, then number of individuals), for benchmark comparison and validation of results.
Contrary to what I stated in the comments, the thrust code is only running about 25% faster than the naive single-threaded CPU code. Your mileage may vary.
This is just one way to think about handling it. I have had other ideas, but not enough time to flesh them out.
The data sets can become rather large. The code right now is intended to be limited to data sets where the product of dimension*number_of_centroids*number_of_individuals is less than 2 billion. However, as you approach even this number, you will need a GPU and CPU that both have a few GB of memory. I briefly explored larger data set sizes. A few code changes would be needed in various places to extend from e.g. int to unsigned long long, etc. However I haven't provided that as I am still investigating an issue with that code.
For another, non-thrust-related look at computing euclidean distances on the GPU, you may be interested in this question. If you follow the sequence of optimizations that were made there, it may shed some light on either how this thrust code might be improved, or else how another non-thrust realization could be used.
Sorry I wasn't able to squeeze more performance out.