How to optimize my C++ OpenMp Matrix Multiplication code

How to optimize my C++ OpenMp Matrix Multiplication code - c++

I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices.
So far I have gotten a 0.700 sec execution time using OpenMp but I want to see if there is other ways I can make it faster using OpenMp?
I appreciate any advice or tips you can give me.
Here is my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
void Multiply()
{
//initialize matrices with random numbers
int aMatrix[1000][1000], i, j;
for( i = 0; i < 1000; ++i)
{for( j = 0; j < 1000; ++j)
{aMatrix[i][j] = rand();}
}
int bMatrix[1000][1000], i1, j2;
for( i1 = 0; i1 < 1000; ++i1)
{for( j2 = 0; j2 < 1000; ++j2)
{bMatrix[i1][j2] = rand();}
}
//Result Matrix
int product[1000][1000] = {0};
for (int row = 0; row < 1000; row++) {
for (int col = 0; col < 1000; col++) {
// Multiply the row of A by the column of B to get the row, column of product.
for (int inner = 0; inner < 1000; inner++) {
product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
}
}
}
}
int main() {
time_t begin, end;
time(&begin);
Multiply();
time(&end);
time_t elapsed = end - begin;
cout << ("Time measured: %ld seconds.\n", elapsed);
return 0;
}

Following things can be done for speedup:
Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.
Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).
Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.
Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.
Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.
Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.
In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.
On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:
float: time 0.1569 sec
double: time 0.3168 sec
int: time 0.1565 sec
To compare my hardware capabilities I did measurements of another answer of #doug for the case of int:
Threads w transpose 0.2164 secs.
As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).
To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).
In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.
My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).
At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.
Try it online!
#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <string>
#include <immintrin.h>
#define USE_OPENMP 1
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#if defined(_MSC_VER)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if USE_OPENMP
#include <omp.h>
#endif
template <typename T, std::size_t N>
class AlignmentAllocator {
public:
typedef T value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
public:
inline AlignmentAllocator() throw() {}
template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
inline ~AlignmentAllocator() throw() {}
inline pointer adress(reference r) { return &r; }
inline const_pointer adress(const_reference r) const { return &r; }
inline pointer allocate(size_type n);
inline void deallocate(pointer p, size_type);
inline void construct(pointer p, const value_type & wert);
inline void destroy(pointer p) { p->~value_type(); }
inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};
template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
#if IS_MSVC
auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
#else
auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
#endif
ASSERT(p);
return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
#if IS_MSVC
_aligned_free(p);
#else
std::free(p);
#endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
new (p) value_type(wert);
}
template <typename T>
using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;
template <typename T>
struct RegT;
#ifdef __AVX__
template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
inline void MulAddReg(float const * a, float const * b, __m256 & c) {
c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
}
inline void MulAddReg(double const * a, double const * b, __m256d & c) {
c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
}
inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
#else // SSE2
template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };
inline void MulAddReg(float const * a, float const * b, __m128 & c) {
c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
}
inline void MulAddReg(double const * a, double const * b, __m128d & c) {
c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
}
inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
#endif
#ifdef __AVX2__
template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
//template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
}
//inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
// c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
//}
inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
//inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
#else // SSE2
template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
//template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
}
//inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
// c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
//}
inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
//inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
#endif
template <typename T>
void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
ASSERT(A_cols == B_rows);
size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
// Copy aligned A
AlignedVector<T> Av(A_rows * A_cols_aligned);
for (size_t i = 0; i < A_rows; ++i)
std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
T const * A = Av.data();
// Transpose B
AlignedVector<T> Bv(B_cols * B_rows_aligned);
for (size_t j = 0; j < B_cols; ++j)
for (size_t i = 0; i < B_rows; ++i)
Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
T const * Bt = Bv.data();
ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
// Multiply
#pragma omp parallel for
for (size_t i = 0; i < A_rows; ++i) {
// Aligned Reg storage
AlignedVector<T> Regs(block);
for (size_t j = 0; j < B_cols; ++j) {
T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
using Reg = typename RegT<T>::type;
Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
size_t const k_hi = A_cols - A_cols % block;
for (size_t k = 0; k < k_hi; k += block) {
MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
}
StoreReg(&Regs[reg_cnt * 0], r0);
StoreReg(&Regs[reg_cnt * 1], r1);
StoreReg(&Regs[reg_cnt * 2], r2);
StoreReg(&Regs[reg_cnt * 3], r3);
T sum1 = 0, sum2 = 0, sum3 = 0;
for (size_t k = 0; k < Regs.size(); ++k)
sum1 += Regs[k];
//for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
for (size_t k = k_hi; k < A_cols; ++k)
sum2 += Arow[k] * Btrow[k];
C[i * A_rows + j] = sum2 + sum1;
}
}
}
#include <random>
#include <thread>
#include <chrono>
#include <type_traits>
template <typename T>
void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
for (size_t i = 0; i < A_rows / 16; ++i)
for (size_t j = 0; j < B_cols / 16; ++j) {
T sum = 0;
for (size_t k = 0; k < A_cols; ++k)
sum += A[i * A_cols + k] * B[k * B_cols + j];
ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
" C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
}
}
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
template <typename T>
void Run() {
size_t constexpr A_rows = 1000, A_cols = 1000, B_rows = 1000, B_cols = 1000;
std::string const tname = std::is_same_v<T, float> ? "float" : std::is_same_v<T, double> ?
"double" : std::is_same_v<T, int32_t> ? "int" : "<unknown>";
bool const is_int = tname == "int";
std::mt19937_64 rng{123};
std::vector<T> A(A_rows * A_cols), B(B_rows * B_cols), C(A_rows * B_cols);
for (size_t i = 0; i < A.size(); ++i)
A[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
for (size_t i = 0; i < B.size(); ++i)
B[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
auto tim = Time();
Mul(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0]);
tim = Time() - tim;
std::cout << std::setw(6) << tname << ": time " << std::fixed << std::setprecision(4) << tim << " sec" << std::endl;
Test(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0], tname == "float" ? T(1e-7) : tname == "double" ? T(1e-15) : T(0));
}
int main() {
try {
#if USE_OPENMP
omp_set_num_threads(std::thread::hardware_concurrency());
#endif
Run<float>();
Run<double>();
Run<int32_t>();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
float: time 0.1569 sec
double: time 0.3168 sec
int: time 0.1565 sec

Here's straight c++ code that runs in .08s with ints and .14s with floats or doubles. My system is 10 years old with relatively slow memory. Good at the time but now is now.
I agree with #VictorEijkhout that the best results would be with tuned code. There has been huge amounts of work optimizing those.
#include <vector>
#include <array>
#include <random>
#include <cassert>
#include <iostream>
#include <iomanip>
#include <thread>
#include <future>
#include <chrono>
struct Timer {
std::chrono::system_clock::time_point snapTime;
Timer() { snapTime = std::chrono::system_clock::now(); }
operator double() { return std::chrono::duration<double>(std::chrono::system_clock::now() - snapTime).count(); }
};
using DataType = int;
using std::array, std::vector;
constexpr int N = 1000, THREADS = 12;
static auto launchType = std::launch::async;
using Matrix = vector<array<DataType, N>>;
Matrix create_matrix() { return Matrix(N); };
Matrix product(Matrix const& v0, Matrix const& v1, double& time)
{
Matrix ret = create_matrix();
Matrix v2 = create_matrix();
Timer timer;
for (size_t r = 0; r < N; r++) // transpose first
for (size_t c = 0; c < N; c++)
v2[c][r] = v1[r][c];
// lambda to process sets of rows in separate threads
auto do_row_set = [&v0, &v2, &ret](size_t start, size_t last) {
for (size_t row = start; row < last; row++)
for (size_t col = 0; col < N; col++)
{
DataType tmp{}; // separate tmp variable significantly improves optimization
for (size_t col_t = 0; col_t < N; col_t++)
tmp += v0[row][col_t] * v2[col][col_t];
ret[row][col] = tmp;
}
};
vector<size_t> seq;
const size_t NN = N / THREADS;
// make a sequence of NN+1 rows from start to end
for (size_t thread_n = 0; thread_n < N; thread_n += NN)
seq.push_back(thread_n);
seq.push_back(N);
vector<std::future<void>> results; results.reserve(THREADS);
for (size_t i = 0; i < THREADS; i++)
results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
for (auto& x : results)
x.get();
time = timer;
return ret;
}
bool operator==(Matrix const& v0, Matrix const& v1)
{
for (size_t r = 0; r < N; r++)
for (size_t c = 0; c < N; c++)
if (v0[r][c] != v1[r][c])
return false;
return true;
}
int main()
{
auto fill = [](Matrix& v) {
std::mt19937_64 r(1);
std::normal_distribution dist(1.);
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
v[row][col] = DataType(dist(r));
};
Matrix m1 = create_matrix(), m2 = create_matrix(), m3 = create_matrix();
fill(m1); fill(m2);
auto process_test = [&m1, &m2](Matrix& out) {
const int rpt_count = 4;
double sum = 0;
for (int i = 0; i < rpt_count; i++)
{
double time;
out = product(m1, m2, time);
sum += time / rpt_count;
}
return sum;
};
std::cout << std::fixed << std::setprecision(4);
double time{};
time = process_test(m3);
std::cout << "Threads w transpose " << time << " secs.\n";
}

Related

Improving speed of affine transformation of an array using intrinsics

In a performance sensitive code, I have to perform am affine transformation of a vector:
Y=a*X+b
where Y and X are vectors and a and b are scalars.
As a quick-and-dirty way to improve the speed of the computation, I delegated parallelization to openMP
#pragma omp simd directive. Having some spare time, lately I tried to implement it directly using intrinsics, getting more or less the same performance as the omp solution.
Is there a way to beat the OMP vectorization? I can use up AVX2 instructions.
The code below is tested under windows 10, compiled with VS 2019.
#include <iostream>
#include <armadillo>
#include <chrono>
#include <immintrin.h>
///Computes y=alpha*x+beta
inline void SumAndSetOmp(
arma::Col<double>& y /**< Result*/,
const arma::Col<double>& x /**< Input*/,
const double& alpha /**< Coefficient*/,
const double& beta /**< Offset*/)
{
auto* __restrict lhs = y.memptr();
const auto* __restrict add_rhs = x.memptr();
const auto& n = x.n_elem;
#pragma omp simd
for (arma::uword i = 0; i < n; ++i)
{
lhs[i] = add_rhs[i] * alpha + beta;
}
}
inline void SumAndSetSerial(
arma::Col<double>& y /**< Result*/,
const arma::Col<double>& x /**< Input*/,
const double& alpha /**< Coefficient*/,
const double& beta /**< Offset*/)
{
auto* lhs = y.memptr();
const auto* add_rhs = x.memptr();
const auto& n = x.n_elem;
for (arma::uword i = 0; i < n; ++i)
{
lhs[i] = add_rhs[i] * alpha + beta;
}
}
inline void SumAndSetAVX(arma::Col<double>& y /**< Result*/,
const arma::Col<double>& x /**< Input*/,
const double& alpha /**< Coefficient*/,
const double& beta /**< Offset*/)
{
//Allocate coefficients
const auto alphas = _mm256_set1_pd(alpha);
const auto betas = _mm256_set1_pd(beta);
//Extracting memory addresses
auto* __restrict pos_lhs = y.memptr();
const auto* __restrict pos_rhs = x.memptr();
//Computing sizes
const unsigned int length_array = 4;
const unsigned long long n_aligned = x.n_elem / length_array;
const unsigned int remainder = x.n_elem % length_array;
//Performing AVX instruction
for (unsigned long long i = 0; i < n_aligned; i++) {
const __m256d x_avx = _mm256_loadu_pd(pos_rhs);
const __m256d y_avx = _mm256_fmadd_pd(x_avx, alphas, betas);
_mm256_storeu_pd(pos_lhs, y_avx);
pos_rhs += length_array;
pos_lhs += length_array;
}
//Process the rest serially
for (unsigned int i = 0; i < remainder; i++) {
pos_lhs[i] = alpha * pos_rhs[i] + beta;
}
}
enum method
{
serial,
omp,
avx
};
arma::vec perform_test(const arma::vec& x, const method mtd, int trials = 100, const double alpha = 3.0, const double beta = 5.0)
{
arma::Col<double> res(x.n_elem);
const auto beg = std::chrono::steady_clock::now();
switch (mtd) {
case serial:
for (int i = 0; i < trials; i++)
SumAndSetSerial(res, x, alpha, beta);
break;
case omp:
for (int i = 0; i < trials; i++)
SumAndSetOmp(res, x, alpha, beta);
break;
case avx:
for (int i = 0; i < trials; i++)
SumAndSetAVX(res, x, alpha, beta);
break;
}
std::cout << "time:" << std::chrono::duration<double>(std::chrono::steady_clock::now() - beg).count() << "s\n";
return res;
}
//Benchmarking
double test_fun(long long int n,int trials=100, const double alpha = 3.0, const double beta = 5.0)
{
const arma::Col<double> x(n, arma::fill::randn);
const arma::Col<double> reference = alpha*x + beta;
std::cout << "Serial: ";
const auto res_serial = perform_test(x, method::serial, trials, alpha, beta);
std::cout << "OMP: ";
const auto res_omp = perform_test(x, method::omp, trials, alpha, beta);
std::cout << "AVX: ";
const auto res_avx = perform_test(x, method::avx, trials, alpha, beta);
// errors wrt the reference
const double err_serial = arma::max(arma::abs(reference - res_serial));
const double err_avx = arma::max(arma::abs(reference - res_avx));
const double err_omp = arma::max(arma::abs(reference - res_omp));
//Largest error
const double error = std::max(std::max(err_serial, err_avx), err_omp);
if (error> 1e-6)
{
throw std::runtime_error("Something is wrong!");
}
return error;
}
int main()
{
test_fun(10000000);
}

I had a go at improving on your solutions, but I think they are optimal, so you're best sticking with omp.
The following is speculative and goes beyond your question:
If it's an option you could try omp's multithreading, I'm always surprised at how low the number of elements at which you get a reasonable boost is, though 1000 elements with a simple affine transform I expect would be too low. If other parts of your algorithm are parallelisable then this is more likely to be helpful.
If you can afford to change your problem and don't need double precision you could work with floats.

Drastic difference in GCC and Clang code performance

community. I have this piece of code that finds closest pair of points in Euclidean 3D space. This question is neither about the algorithm nor its implementation or whatever. The problem is that it runs significantly slower when compiled with GCC rather than Clang. Most confusingly, it has comparable execution time on random samples and like 100 times slower on some specific one.
I suspect there can be a bug in GCC as I cannot think of any other option.
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <algorithm>
#include <cmath>
#include <vector>
#include <set>
#include <map>
#include <unordered_set>
#include <unordered_map>
#include <queue>
#include <ctime>
#include <fstream>
#include <cassert>
#include <complex>
#include <string>
#include <cstring>
#include <chrono>
#include <random>
#include <queue>
static std::mt19937 mmtw(std::chrono::steady_clock::now().time_since_epoch().count());
int64_t rng(int64_t x, int64_t y) {
static std::uniform_int_distribution<int64_t> d;
return d(mmtw) % (y - x + 1) + x;
}
constexpr static int MAXN = 1e5 + 10;
void solve(std::istream &in, std::ostream &out);
void generate(std::ostream &out) {
constexpr int N = 1e5;
out << N << '\n';
int MIN = -1e6;
int MAX = 1e6;
for (int i = 0; i < N; ++i) {
out << 0 << ' ';
out << i << ' ';
out << (i + 1) * int(1e4) << '\n';
}
}
int main() {
freopen("input.txt", "r", stdin);
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::cout.tie(nullptr);
std::cerr.tie(nullptr);
std::ofstream fout("input.txt");
generate(fout);
fout.close();
solve(std::cin, std::cout);
return 0;
}
struct point_t {
int32_t x, y, z;
int id;
point_t() = default;
point_t(int32_t x, int32_t y, int32_t z) : x(x), y(y), z(z) {}
point_t operator +(const point_t &rhs) const {
return point_t(x + rhs.x, y + rhs.y, z + rhs.z);
}
point_t operator -(const point_t &rhs) const {
return point_t(x - rhs.x, y - rhs.y, z - rhs.z);
}
int64_t abs2() const {
return 1LL * x * x + 1LL * y * y + 1LL * z * z;
}
};
std::istream &operator >>(std::istream &in, point_t &pt) {
return in >> pt.x >> pt.y >> pt.z;
}
inline bool cmp_x(const point_t &lhs, const point_t &rhs) {
return lhs.x < rhs.x;
}
inline bool cmp_y(const point_t &lhs, const point_t &rhs) {
return lhs.y < rhs.y;
}
inline bool cmp_z(const point_t &lhs, const point_t &rhs) {
return lhs.z < rhs.z;
}
struct pair_t {
int64_t distance_sq;
point_t a {}, b {};
pair_t() : distance_sq(std::numeric_limits<int64_t>::max()) {};
pair_t(const point_t &a, const point_t &b) : distance_sq((a - b).abs2()), a(a), b(b) {}
bool operator<(const pair_t &rhs) const {
return distance_sq < rhs.distance_sq;
}
};
template <typename T> inline T sqr(T arg) { return arg * arg; }
point_t pts[MAXN];
static pair_t ans = pair_t();
void recur_2D(point_t pts[], int size, int64_t threshold_sq) {
if (size <= 3) {
for (int i = 0; i < size; ++i) {
for (int j = i + 1; j < size; ++j) {
ans = std::min(ans, pair_t(pts[i], pts[j]));
}
}
std::sort(pts, pts + size, cmp_y);
return;
}
int mid = size / 2;
int midx = pts[mid].x;
recur_2D(pts, mid, threshold_sq);
recur_2D(pts + mid, size - mid, threshold_sq);
static point_t buffer[MAXN];
std::merge(pts, pts + mid, pts + mid, pts + size, buffer, cmp_y);
std::copy(buffer, buffer + size, pts);
int buff_sz = 0;
for (int i = 0; i < size; ++i) {
if (sqr(pts[i].x - midx) >= threshold_sq) {
continue;
}
int64_t x_sqr = sqr(pts[i].x - midx);
for (int j = buff_sz - 1; j >= 0; --j) {
if (sqr(pts[i].y - buffer[j].y) + x_sqr >= threshold_sq) {
break;
}
ans = std::min(ans, pair_t(pts[i], buffer[j]));
}
buffer[buff_sz++] = pts[i];
}
}
void recur_3D(point_t pts[], int size) {
if (size <= 3) {
for (int i = 0; i < size; ++i) {
for (int j = i + 1; j < size; ++j) {
ans = std::min(ans, pair_t(pts[i], pts[j]));
}
}
std::sort(pts, pts + size, cmp_x);
return;
}
int mid = size / 2;
int midz = pts[mid].z;
recur_3D(pts, mid);
recur_3D(pts + mid, size - mid);
static point_t buffer[MAXN];
std::merge(pts, pts + mid, pts + mid, pts + size, buffer, cmp_x);
std::copy(buffer, buffer + size, pts);
int buff_sz = 0;
for (int i = 0; i < size; ++i) {
if (sqr(pts[i].z - midz) >= ans.distance_sq) {
continue;
}
buffer[buff_sz++] = pts[i];
}
recur_2D(buffer, buff_sz, ans.distance_sq);
}
void solve(std::istream &in, std::ostream &out) {
clock_t start = clock();
int num_of_points;
in >> num_of_points;
for (int i = 0; i < num_of_points; ++i) {
in >> pts[i];
pts[i].id = i + 1;
}
std::sort(pts, pts + num_of_points, cmp_z);
recur_3D(pts, num_of_points);
out << ans.distance_sq << '\n';
out << 1.0 * (clock() - start) / CLOCKS_PER_SEC << " s.\n";
}
Link to this code: https://code.re/2yfPzjkD
It generates the sample that makes the code very slow and then measures algorithm execution time.
I compile with
g++ -DLOCAL -std=c++1z -O3 -Wno-everything main.cpp
and with
clang++ -DLOCAL -std=c++1z -O3 -Wno-everything main.cpp
and run
./main while having input.txt in the same directory.
The Clang-copiled binary runs in 0.053798 s. while the GCC one in 12.4276 s. . These numbers are from program's output, you can see that function solve.
I have also verified the difference on https://wandbox.org/ on different compiler versions.
https://wandbox.org/permlink/YFEEWSKyos2dQf32 -- clang
https://wandbox.org/permlink/XctarNHvd3I1B0x8 -- gcc
Note, I compressed input and thus had to change reading in solve a bit.
On my local machine I have these compilers.
clang++ --version
clang version 7.0.0 (tags/RELEASE_700/final)
g++ --version
g++ (GCC) 8.2.1 20180831
It feels like I run GCC code without compiler optimizations. What can be the reason?
UPD.
Also, there is a version that calls std::sort only once in the very beginning.
https://wandbox.org/permlink/i9Kd3GdewxSRwXsM
I have also tried to compile Clang with -stdlib=libstdc++, shuffling data and think that different implementations of std::sort are not cause.

This is simply undefined behavior. Your code has undefined behavior due to signed integer overflow at:
template <typename T> inline T sqr(T arg) { return arg * arg; }
You can replace that with:
template <typename T>
inline T sqr(T arg)
{
assert(double(arg)*arg <= std::numeric_limits<T>::max());
assert(double(arg)*arg >= std::numeric_limits<T>::min());
return arg * arg;
}
and catch the error in a debugger. It fails with arg=-60000 called from recur_3D on the line:
if (sqr(pts[i].z - midz) >= ans.distance_sq) {
this happens with pts[i] = {x = 0, y = 0, z = 10000, id = 1} and midz=70000.
Because this is undefiend behavior, all bets are off. Different compiles utilize the assumption that "undefined behavior never happens" in different ways. This is why clang and gcc perform differently, and it is pure "luck".
Consider using UndefinedBehaviorSanitizer to catch these errors. I don't have it on my clang installation, but clang++ -fsanitize=signed-integer-overflow should do the trick.
Fixing this function gives a comparable speed for both clang and gcc.

Is it possible to use CUDA parallelizing this nested for loop?

I want to speed up this nested for loop, just start learn CUDA, how could I use CUDA to parallel this c++ code ?
#define PI 3.14159265
using namespace std;
int main()
{
int nbint = 2;
int hits = 20;
int nbinp = 2;
float _theta, _phi, _l, _m, _n, _k = 0, delta = 5;
float x[20],y[20],z[20],a[20],t[20];
for (int i = 0; i < hits; ++i)
{
x[i] = rand() / (float)(RAND_MAX / 100);
}
for (int i = 0; i < hits; ++i)
{
y[i] = rand() / (float)(RAND_MAX / 100);
}
for (int i = 0; i < hits; ++i)
{
z[i] = rand() / (float)(RAND_MAX / 100);
}
for (int i = 0; i < hits; ++i)
{
a[i] = rand() / (float)(RAND_MAX / 100);
}
float maxforall = 1e-6;
float theta0;
float phi0;
for (int i = 0; i < nbint; i++)
{
_theta = (0.5 + i)*delta;
for (int j = 0; j < nbinp; j++)
{
_phi = (0.5 + j)*delta / _theta;
_l = sin(_theta* PI / 180.0)*cos(_phi* PI / 180.0);
_m = sin(_theta* PI / 180.0)*sin(_phi* PI / 180.0);
_n = cos(_theta* PI / 180.0);
for (int k = 0; k < hits; k++)
{
_k = -(_l*x[k] + _m*y[k] + _n*z[k]);
t[k] = a[k] - _k;
}
qsort(t, 0, hits - 1);
float max = t[0];
for (int k = 0; k < hits; k++)
{
if (max < t[k])
max = t[k];
}
if (max > maxforall)
{
maxforall = max;
}
}
}
return 0;
}
I want to put innermost for loop and the sort part(maybe the whole nested loop) into parallel. After sort those array I found the maximum of all arrays. I use maximum to simplify the code. The reason I need sort is that maximum represent
here is a continuous time information(all arrays contain time information). The sort part make those time from lowest to highest. Then I compare the a specific time interval(not a single value). The compare process almost like I choose maximum but with a continuous interval not a single value.

Your 3 nested loops calculate nbint*nbinp*hits values. Since each of those values is independent from each other, all values can be calculated in parallel.
You stated in your comments that you have a commutative and associative "filter condition" which reduces the output to a single scalar value. This can be exploited to avoid sorting and storing the temporary values. Instead, we can calculate the values on-the-fly and then apply a parallel reduction to determine the end result.
This can be done in "raw" CUDA, below I implemented this idea using thrust. The main idea is to run grid_op nbint*nbinp*hits times in parallel. In order to find out the three original "loop indices" from the single scalar index which is passed to grid_op the algorithm from this SO question is used.
thrust::transform_reduce performs the on-the-fly transformation and the subsequent parallel reduction (here thrust::maximum is used as a substitute).
#include <cmath>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/tuple.h>
// ### BEGIN utility for demo ####
#include <iostream>
#include <thrust/random.h>
thrust::host_vector<float> random_vector(const size_t N)
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
thrust::host_vector<float> temp(N);
for(size_t i = 0; i < N; i++) {
temp[i] = u01(rng);
}
return temp;
}
// ### END utility for demo ####
template <typename... Iterators>
thrust::zip_iterator<thrust::tuple<Iterators...>> zip(Iterators... its)
{
return thrust::make_zip_iterator(thrust::make_tuple(its...));
}
template <typename ZipIterator>
class grid_op
{
public:
grid_op(ZipIterator zipIt, std::size_t dim1, std::size_t dim2) : zipIt(zipIt), dim1(dim1), dim2(dim2){}
__host__ __device__
float operator()(std::size_t index) const
{
const auto coords = unflatten_3d_index(index, dim1, dim2);
const auto values = zipIt[thrust::get<2>(coords)];
const float delta = 5;
const float _theta = (0.5f + thrust::get<0>(coords))*delta;
const float _phi = (0.5f + thrust::get<1>(coords))*delta / _theta;
const float _l = sin(_theta* M_PI / 180.0)*cos(_phi* M_PI / 180.0);
const float _m = sin(_theta* M_PI / 180.0)*sin(_phi* M_PI / 180.0);
const float _n = cos(_theta* M_PI / 180.0);
const float _k = -(_l*thrust::get<0>(values) + _m*thrust::get<1>(values) + _n*thrust::get<2>(values));
return (thrust::get<3>(values) - _k);
}
private:
__host__ __device__
thrust::tuple<std::size_t, std::size_t, std::size_t>
unflatten_3d_index(std::size_t index, std::size_t dim1, std::size_t dim2) const
{
// taken from https://stackoverflow.com/questions/29142417/4d-position-from-1d-index
std::size_t x = index % dim1;
std::size_t y = ( ( index - x ) / dim1 ) % dim2;
std::size_t z = ( ( index - y * dim1 - x ) / (dim1 * dim2) );
return thrust::make_tuple(x,y,z);
}
ZipIterator zipIt;
std::size_t dim1;
std::size_t dim2;
};
template <typename ZipIterator>
grid_op<ZipIterator> make_grid_op(ZipIterator zipIt, std::size_t dim1, std::size_t dim2)
{
return grid_op<ZipIterator>(zipIt, dim1, dim2);
}
int main()
{
const int nbint = 3;
const int nbinp = 4;
const int hits = 20;
const std::size_t N = nbint * nbinp * hits;
thrust::device_vector<float> d_x = random_vector(hits);
thrust::device_vector<float> d_y = random_vector(hits);
thrust::device_vector<float> d_z = random_vector(hits);
thrust::device_vector<float> d_a = random_vector(hits);
auto zipIt = zip(d_x.begin(), d_y.begin(), d_z.begin(), d_a.begin());
auto countingIt = thrust::counting_iterator<std::size_t>(0);
auto unary_op = make_grid_op(zipIt, nbint, nbinp);
auto binary_op = thrust::maximum<float>();
const float init = 0;
float max = thrust::transform_reduce(
countingIt, countingIt+N,
unary_op,
init,
binary_op
);
std::cout << "max = " << max << std::endl;
}

Combining integers and floating point numbers: performance considerations

I have a complex set of template functions which do calculations in a loop, combining floating point numbers and the uint32_t loop indices. I was surprised to observe that for this kind of functions, my test code runs faster with double precision floating point numbers than with single precision ones.
As a test, I changed the format of my indices to uint16_t. After this, both the double and float version of the program were faster (as expected), but now the float version was significantly faster than the double version. I also tested the program with uint64_t indices. In this case the double and the float version are equally slow.
I imagine that this is because an uint32_t fits into the mantissa of a double but not into a float. Once the indices type was reduced to uint16_t, they also fit into the mantissa of a float and a conversion should be trivial. In case of uint64_t, the conversion to double also needs rounding, which would explain why both versions perform equally.
Can anybody confirm this explanation?
EDIT: Using int or long as index type, the program runs as fast as for unit16_t. I guess this speaks against what I suspected first.
EDIT: I compiled the program for windows on an x86 architecture.
EDIT: Here is a piece of code that reproduces the effect of double being faster as float for uint32_t and both cases being equally fast for int. Please do not comment on the usefulness of this code. It is a modified fragment of code reproducing the effect which does nothing sensible.
The main file:
#include "stdafx.h"
typedef short spectraType;
typedef int intermediateValue;
typedef double returnType;
#include "Preprocess_t.h"
#include "Windows.h"
#include <iostream>
int main()
{
const size_t numberOfBins = 10000;
const size_t numberOfSpectra = 500;
const size_t peakWidth = 25;
bool startPeak = false;
short peakHeight;
Preprocess<short, returnType> myPreprocessor;
std::vector<returnType> processedSpectrum;
std::vector<std::vector<short>> spectra(numberOfSpectra, std::vector<short>(numberOfBins));
std::vector<float> peakShape(peakWidth);
LARGE_INTEGER freq, start, stop;
double time_ms;
QueryPerformanceFrequency(&freq);
for (size_t i = 0; i < peakWidth; ++i)
{
peakShape[i] = static_cast<float>(exp(-(i - peakWidth / 2.0) *(i - peakWidth / 2.0) / 10.0));
}
for (size_t i = 0; i < numberOfSpectra; ++i)
{
size_t j = 0;
for (; j < 200; ++j)
{
spectra[i][j] = rand() % 100;
}
for (size_t k = 0; k < 25; ++k)
{
spectra[i][j] = static_cast<short>(16383 * peakShape[k]);
j++;
}
for (; j < numberOfBins; ++j)
{
startPeak = !static_cast<bool>(abs(rand()) % (numberOfBins / 4));
if (startPeak)
{
peakHeight = rand() % 16384;
for (size_t k = 0; k < 25 && j< numberOfBins; ++k)
{
spectra[i][j] = peakHeight * peakShape[k] + rand() % 100;
j++;
}
}
else
{
spectra[i][j] = rand() % 100;
}
}
for (j = 0; j < numberOfBins; ++j)
{
double temp = 1000.0*exp(-(static_cast<float>(j) / (numberOfBins / 3.0)))*sin(static_cast<float>(j) / (numberOfBins / 10.0));
spectra[i][j] -= static_cast<short>(1000.0*exp(-(static_cast<float>(j) / (numberOfBins / 3.0)))*sin(static_cast<float>(j) / (numberOfBins / 10.0)));
}
}
// This is where the critical code is called
QueryPerformanceCounter(&start);
for (int i = 0; i < numberOfSpectra; ++i)
{
myPreprocessor.SetSpectrum(&spectra[i], 1000, &processedSpectrum);
myPreprocessor.CorrectBaseline(30, 2.0);
}
QueryPerformanceCounter(&stop);
time_ms = static_cast<double>(stop.QuadPart - start.QuadPart) / static_cast<double>(freq.QuadPart);
std::cout << "time spend preprocessing: " << time_ms << std::endl;
std::cin.ignore();
return 0;
}
And the included header Preprocess_t.h:
#pragma once
#include <vector>
//typedef unsigned int indexType;
typedef unsigned short indexType;
template<typename T, typename Out_Type>
class Preprocess
{
public:
Preprocess() :threshold(1), sdev(1), laserPeakThreshold(500), a(0), b(0), firstPointUsedAfterLaserPeak(0) {};
~Preprocess() {};
void SetSpectrum(std::vector<T>* input, T laserPeakThreshold, std::vector<Out_Type>* processedSpectrum); ///#note We need the laserPeakThresholdParameter for the baseline correction, not onla for the shift.
void CorrectBaseline(indexType numberOfPoints, Out_Type thresholdFactor);
private:
void LinFitValues(indexType beginPoint);
Out_Type SumOfSquareDiffs(Out_Type x, indexType n);
Out_Type LinResidualSumOfSquareDist(indexType beginPoint);
std::vector<T>* input;
std::vector<Out_Type>* processedSpectrum;
std::vector<indexType> fitWave_X;
std::vector<Out_Type> fitWave;
Out_Type threshold;
Out_Type sdev;
T laserPeakThreshold;
Out_Type a, b;
indexType firstPointUsedAfterLaserPeak;
indexType numberOfPoints;
};
template<typename T, typename Out_Type>
void Preprocess<T, Out_Type>::CorrectBaseline(indexType numberOfPoints, Out_Type thresholdFactor)
{
this->numberOfPoints = numberOfPoints;
indexType numberOfBins = input->size();
indexType firstPointUsedAfterLaserPeak = 0;
indexType positionInFitWave = 0;
positionInFitWave = firstPointUsedAfterLaserPeak;
for (indexType i = firstPointUsedAfterLaserPeak; i < numberOfBins - numberOfPoints; i++) {
LinFitValues(positionInFitWave);
processedSpectrum->at(i + numberOfPoints) = input->at(i + numberOfPoints) - static_cast<Out_Type>(a + b*(i + numberOfPoints));
positionInFitWave++;
fitWave[positionInFitWave + numberOfPoints - 1] = input->at(i + numberOfPoints - 1);
fitWave_X[positionInFitWave + numberOfPoints - 1] = i + numberOfPoints - 1;
}
}
template<typename T, typename Out_Type>
void Preprocess<T, Out_Type>::LinFitValues(indexType beginPoint)
{
Out_Type y_mean, x_mean, SSxy, SSxx, normFactor;
y_mean = x_mean = SSxy = SSxx = normFactor = static_cast<Out_Type>(0);
indexType endPoint = beginPoint + numberOfPoints;
Out_Type temp;
if ((fitWave_X[endPoint - 1] - fitWave_X[beginPoint]) == numberOfPoints)
{
x_mean = (fitWave_X[endPoint - 1] - fitWave_X[beginPoint]) / static_cast<Out_Type>(2);
for (indexType i = beginPoint; i < endPoint; i++) {
y_mean += fitWave[i];
}
y_mean /= numberOfPoints;
SSxx = SumOfSquareDiffs(x_mean, fitWave_X[endPoint - 1]) - SumOfSquareDiffs(x_mean, fitWave_X[beginPoint]);
for (indexType i = beginPoint; i < endPoint; i++)
{
SSxy += (fitWave_X[i] - x_mean)*(fitWave[i] - y_mean);
}
}
else
{
for (indexType i = beginPoint; i < endPoint; i++) {
y_mean += fitWave[i];
x_mean += fitWave_X[i];
}
y_mean /= numberOfPoints;
x_mean /= numberOfPoints;
for (indexType i = beginPoint; i < endPoint; i++)
{
temp = (fitWave_X[i] - x_mean);
SSxy += temp*(fitWave[i] - y_mean);
SSxx += temp*temp;
}
}
b = SSxy / SSxx;
a = y_mean - b*x_mean;
}
template<typename T, typename Out_Type>
inline Out_Type Preprocess<T, Out_Type>::SumOfSquareDiffs(Out_Type x, indexType n)
{
return n*x*x + n*(n - 1)*x + ((n - 1)*n*(2 * n - 1)) / static_cast<Out_Type>(6);
}
template<typename T, typename Out_Type>
Out_Type Preprocess<T, Out_Type>::LinResidualSumOfSquareDist(indexType beginPoint)
{
Out_Type sumOfSquares = 0;
Out_Type temp;
for (indexType i = 0; i < numberOfPoints; ++i) {
temp = fitWave[i + beginPoint] - (a + b*fitWave_X[i + beginPoint]);
sumOfSquares += temp*temp;
}
return sumOfSquares;
}
template<typename T, typename Out_Type>
inline void Preprocess<T, Out_Type>::SetSpectrum(std::vector<T>* input, T laserPeakThreshold, std::vector<Out_Type>* processedSpectrum)
{
this->input = input;
fitWave_X.resize(input->size());
fitWave.resize(input->size());
this->laserPeakThreshold = laserPeakThreshold;
this->processedSpectrum = processedSpectrum;
processedSpectrum->resize(input->size());
}

You are using MSVC? I had a similar effect when I implemented code that essentially was a matrix-multiplication plus a vector addition. Here, I thought that floats would be faster because they can be better SIMD-parallelized as more can be packed in the SSE registers. However, using doubles was much faster.
After some investigation, I figured out from the assembler code that the float's need conversion from the internal FPU precision and this rounding was consuming most of the runtime. You can change the FP model to something that is faster with the cost of reduced precision. There is also some discussion in older threads here at SO.

Optimization of determinant calculation function

Searching for the best algorithm I found there is a tradeoff: complexity to implement and big constant on the one hand, and runtime complexity on the other hand. I choose LU-decomposition-based algorithm, because it is quite simple to implement and have good enough performance.
#include <valarray>
#include <vector>
#include <utility>
#include <cmath>
#include <cstddef>
#include <cassert>
template< typename value_type >
struct math
{
using size_type = std::size_t;
size_type const dimension_;
value_type const & eps;
value_type const zero = value_type(0);
value_type const one = value_type(1);
private :
using vector = std::valarray< value_type >;
using matrix = std::vector< vector >;
matrix matrix_;
matrix minor_;
public :
math(size_type const _dimension,
value_type const & _eps)
: dimension_(_dimension)
, eps(_eps)
, matrix_(dimension_)
, minor_(dimension_ - 1)
{
assert(1 < dimension_);
assert(!(eps < zero));
for (size_type r = 0; r < dimension_; ++r) {
matrix_[r].resize(dimension_);
}
size_type const minor_size = dimension_ - 1;
for (size_type r = 0; r < minor_size; ++r) {
minor_[r].resize(minor_size);
}
}
template< typename rhs = matrix >
void
operator = (rhs const & _matrix)
{
auto irow = std::begin(matrix_);
for (auto const & row_ : _matrix) {
auto icol = std::begin(*irow);
for (auto const & v : row_) {
*icol = v;
++icol;
}
++irow;
}
}
value_type
det(matrix & _matrix,
size_type const _dimension)
{ // calculates lower unit triangular matrix and upper triangular
assert(0 < _dimension);
value_type det_ = one;
for (size_type i = 0; i < _dimension; ++i) {
vector & ri_ = _matrix[i];
using std::abs;
value_type max_ = abs(ri_[i]);
size_type pivot = i;
{
size_type p = i;
while (++p < _dimension) {
value_type y_ = abs(_matrix[p][i]);
if (max_ < y_) {
max_ = std::move(y_);
pivot = p;
}
}
}
if (!(eps < max_)) { // regular?
return zero; // singular
}
if (pivot != i) {
det_ = -det_; // each permutation flips sign of det
ri_.swap(_matrix[pivot]);
}
value_type & dia_ = ri_[i];
det_ *= dia_; // det is multiple of diagonal elements
for (size_type j = 1 + i; j < _dimension; ++j) {
_matrix[j][i] /= dia_;
}
for (size_type a = 1 + i; a < _dimension; ++a) {
vector & a_ = minor_[a - 1];
value_type const & ai_ = _matrix[a][i];
for (size_type b = 1 + i; b < _dimension; ++b) {
a_[b - 1] = ai_ * ri_[b];
}
}
for (size_type a = 1 + i; a < _dimension; ++a) {
vector const & a_ = minor_[a - 1];
vector & ra_ = _matrix[a];
for (size_type b = 1 + i; b < _dimension; ++b) {
ra_[b] -= a_[b - 1];
}
}
}
return det_;
}
value_type
det(size_type const _dimension)
{
return det(matrix_, _dimension);
}
value_type
det()
{
return det(dimension_);
}
};
// main.cpp
#include <iostream>
#include <cstdlib>
int
main()
{
using value_type = double;
value_type const eps = std::numeric_limits< value_type >::epsilon();
std::size_t const dimension_ = 3;
math< value_type > m(dimension_, eps);
m = { // example from https://en.wikipedia.org/wiki/Determinant#Laplace.27s_formula_and_the_adjugate_matrix
{-2.0, 2.0, -3.0},
{-1.0, 1.0, 3.0},
{ 2.0, 0.0, -1.0}
};
std::cout << m.det() << std::endl; // 18
return EXIT_SUCCESS;
}
LIVE DEMO
det() function is hottest function in the algorithm, that uses it as a part. I sure det() is not as fast as it can be, because runtime performance comparisons (using google-pprof) to reference implementation of the whole algorithm shows a disproportion towards det().
How to improve performance of det() function? What are evident optimizations to apply immediately? Should I change the indexing and memory access order or something else? Container types? Prefetching?
Typical value of dimension_ is in the range of 3 to 10 (but can be 100, if value_type is mpfr or something else).

Isn't your (snippet from det())
for (size_type a = 1 + i; a < _dimension; ++a) {
vector & a_ = minor_[a - 1];
value_type const & ai_ = _matrix[a][i];
for (size_type b = 1 + i; b < _dimension; ++b) {
a_[b - 1] = ai_ * ri_[b];
}
}
for (size_type a = 1 + i; a < _dimension; ++a) {
vector const & a_ = minor_[a - 1];
vector & ra_ = _matrix[a];
for (size_type b = 1 + i; b < _dimension; ++b) {
ra_[b] -= a_[b - 1];
}
}
doing the same as
for (size_type a = 1 + i; a < _dimension; ++a) {
vector & ra_ = _matrix[a];
value_type ai_ = ra_[i];
for (size_type b = 1 + i; b < _dimension; ++b) {
ra_[b] -= ai_ * ri_[b];
}
}
without any need for minor_? Moreover, now the inner loop can easily be vectorised.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to optimize my C++ OpenMp Matrix Multiplication code - c++

Related

Improving speed of affine transformation of an array using intrinsics

Drastic difference in GCC and Clang code performance

Is it possible to use CUDA parallelizing this nested for loop?

Combining integers and floating point numbers: performance considerations

Optimization of determinant calculation function

Categories

Resources