Write numpy einsum operation as eigen tensors - c++
I want to write the following numpy einsum as a an Eigen Tensor op
import numpy as np
L = np.random.rand(2, 2, 136)
U = np.random.rand(2, 2, 136)
result = np.einsum('ijl,jkl->ikl', U, L)
I can write it with for loops like so in C++
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
for (int k = 0; k < 2; k++) {
for (int l = 0; l < 136; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
How do I write in eigen notation using its operations? Using for loops doesn't allow eigen to properly vectorize the operations, as I have complicated scalar types.
Edit.
As asked for, a Jet is an extension of dual numbers, where each element is a number, followed by an array of gradients of that number wrt some parameters.
http://ceres-solver.org/automatic_derivatives.html
A naive implmentation might look like
template<typename T, int N>
struct Jet
{
T a;
T v[N];
};
If the jet is written using eigen ops, the idea is that using expression templates, eigen should vectorize all operations directly.
There is no contraction happening in the 3rd dimension "l" in your case. So, in a sense, L and U are arrays of length 136 of 2x2 matrices, and you are multiplying the matrix U[l] with L[l]. I think doing something similar to np.einsum with Eigen is therefore not possible; Eigen::Tensor::contract only supports "real" contractions. But one can of course always do the loop over the 3rd dimension manually. But as shown below, this performs very badly.
Nevertheless, there are ways to speed things up and vectorize the loops, by either relying on automatic vectorization (did not work well for me) or by giving additional compiler hints (via OpenMP SIMD).
In the following, I define cDim12=2 as the size of the first and second dimension, and cDim13=136 as the third dimension.
For the timings, all code was compiled with -O3 -mavx with gcc 11.2 and clang 15.0.2. I used google benchmark to get the timings on an Intel Core i7-4770K (yeah, quite a few years old, sorry). Eigen trunk (08c961e83) from 20th January 2023 was used.
TL;DR: To summarize the results below:
Using the manual loop (with better iteration order), auto-vectorization via an OpenMP-SIMD pragma and also raw access with gcc was the fastest ("DirectAccessWithOMP"): 2.8x faster than your straightforward loop with AVX. I guess this comes close to being "optimal" (cf. godbolt).
I couldn't get clang to vectorize the loop properly. Since you mentioned that either gcc or clang is fine, you seem to have the choice and I'd stick with gcc.
Python appears to be an order of magnitude or more slower compared to the fastest gcc result.
Note: Measure your real world application, since things might behave completely different there!
Code from the original post as baseline ("FromOriginalPost")
The straightforward code from your original post looks like this and is used as baseline.
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
result.setZero();
for (int i = 0; i < cDim12; i++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int l = 0; l < cDim3; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
Optimized loop order ("OptimizedOrder")
Note that Eigen::Tensor uses column major order by default (and row major is not recommended). Thus, in an expression such as U(i, j, l), the i should be the fastest (most inner) loop and l the slowest (most outer) loop. Reordering as best as I could:
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
This is 1.3x-1.4x faster.
Using Eigen::Tensor::chip and contract ("EigenChipAndContract")
Using Eigen features as much as possible, I came up with the following:
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (int l = 0; l < cDim3; l++) {
result.chip(l, 2) = U.chip(l, 2).contract(L.chip(l, 2), productDims);
}
This performs very bad: It is 18x slower on gcc and 24x slower on clang when compared to "FromOriginalPost".
Using Eigen::TensorMap and contract ("EigenMapAndContract")
The "EigenChipAndContract" might do a lot of copying, so another idea was to use Eigen::TensorMap to get "references" to each necessary "slice" of data. For the raw array access, note again that Eigen uses column-major order.
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (int l = 0; l < cDim3; l++) {
Eigen::TensorMap<Eigen::Tensor<double, 2>> U_chip(U.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> L_chip(L.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> result_chip(result.data() + l * cDim12 * cDim12, cDim12, cDim12);
result_chip = U_chip.contract(L_chip, productDims);
}
This is actually somewhat faster than "EigenChipAndContract", but still very slow. Compared to "FromOriginalPost", this is 14x slower for gcc and 19x slower for clang.
Vectorization with OpenMP ("EigenAccessWithOMP")
Although both gcc and clang can do automatic vectorization, without additional hints they do not yield good results. However, both support the OpenMP pragma #pragma omp simd collapse(4), when compiled with -fopenmp:
#pragma omp simd collapse(4)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
Compilation with -O3 -mavx -fopenmp results in
a 2.5x faster runtime compared to the original code ("FromOriginalPost") for gcc,
but for clang, however, the code is 2.5x slower. Searching for clang + OpenMP-SIMD issues, apparently clang does have troubles sometimes (e.g. in this post).
Checking the result on godbolt, clang indeed produces quite lengthy results.
Vectorization with OpenMP + direct raw access ("DirectAccessWithOMP")
The previous code used the Eigen::Tensor::operator(), which should inline to the raw array accesses. However, remembering the column-major layout, we can also access the underlying array directly and check whether this improves anything. It also allows to give the hint to the compiler again that the data is properly aligned (although Eigen already defines them as such).
double * pR = result.data();
double * pU = U.data();
double * pL = L.data();
#pragma omp simd collapse(4) aligned(pR, pU, pL: 32) // 32: For AVX
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
Somewhat surprisingly, this is 1.1x faster for gcc and 1.4x faster for clang when compared with "EigenAccessWithOMP".
When compared with the original "FromOriginalPost", it is 2.8x faster for gcc and 2.5x slower for clang.
When viewed on godbolt, gcc really produces some quite concise assembly.
Python
Not sure how far fetched it is to compare the absolute execution time of a call to np.einsum with the C++ version, since Python needs to do additional string parsing etc. Nevertheless, here is the code:
import numpy as np
import timeit
L = np.random.rand(2, 2, 136)
U = np.random.rand(2, 2, 136)
numIterations = 1000000
timing = timeit.timeit(lambda: np.einsum('ijl,jkl->ikl', U, L), number=numIterations)
print(f"np.einsum (per iteration): {timing.real/(numIterations*1e-9)}ns")
For Python 3.9 and numpy-1.24.1 this is roughly 6x times slower compared to "FromOriginalPost" and 16x times slower compared to "DirectAccessWithOMP" for gcc.
Raw timings
For gcc:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
FromOriginalPost 823 ns 823 ns 3397793
OptimizedOrder 573 ns 573 ns 4895246
DirectAccess 1306 ns 1306 ns 2142826
EigenAccessWithOMP 324 ns 324 ns 8655549
DirectAccessWithOMP 296 ns 296 ns 9418635
EigenChipAndContract 14405 ns 14405 ns 193548
EigenMapAndContract 11390 ns 11390 ns 243122
For clang:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
FromOriginalPost 753 ns 753 ns 3714543
OptimizedOrder 570 ns 570 ns 4921914
DirectAccess 569 ns 569 ns 4929755
EigenAccessWithOMP 2704 ns 2704 ns 1037819
DirectAccessWithOMP 1908 ns 1908 ns 1466390
EigenChipAndContract 17713 ns 17713 ns 157427
EigenMapAndContract 14064 ns 14064 ns 198875
Python:
np.einsum (per iteration): 4873.6035999991145 ns
Full code
Also on godbolt, however not really useful since the compiler times out there quite often.
Locally I compiled with -O3 -DNDEBUG -std=c++17 -mavx -fopenmp -Wall -Wextra.
#include <iostream>
#include <iomanip>
#include <cmath>
#include <unsupported/Eigen/CXX11/Tensor>
#include <benchmark/benchmark.h>
//====================================================
// Globals
//====================================================
static constexpr int cDim12 = 2;
static constexpr int cDim3 = 136;
Eigen::Tensor<double, 3> CreateRandomTensor()
{
Eigen::Tensor<double, 3> m(cDim12, cDim12, cDim3);
m.setRandom();
return m;
}
Eigen::Tensor<double, 3> const L = CreateRandomTensor();
Eigen::Tensor<double, 3> const U = CreateRandomTensor();
//====================================================
// Helpers
//====================================================
Eigen::Tensor<double, 3> ReferenceResult()
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
result.setZero();
for (int i = 0; i < cDim12; i++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int l = 0; l < cDim3; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
return result;
}
void CheckResult(Eigen::Tensor<double, 3> const & result)
{
Eigen::Tensor<double, 3> const ref = ReferenceResult();
Eigen::Tensor<double, 3> const diff = ref - result;
Eigen::Tensor<double, 0> const max = diff.maximum();
Eigen::Tensor<double, 0> const min = diff.minimum();
double const maxDiff = std::max(std::abs(max(0)), std::abs(min(0)));
if (maxDiff > 1e-14) {
std::cerr << "ERROR! Max Diff = " << std::setprecision(17) << maxDiff << std::endl;
}
}
//====================================================
// Benchmarks
//====================================================
static void FromOriginalPost(benchmark::State& state)
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int i = 0; i < cDim12; i++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int l = 0; l < cDim3; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(FromOriginalPost);
static void OptimizedOrder(benchmark::State& state)
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(OptimizedOrder);
static void DirectAccess(benchmark::State& state)
{
Eigen::Tensor<double, 3> U = ::U;
Eigen::Tensor<double, 3> L = ::L;
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
double * pR = result.data();
double * pU = U.data();
double * pL = L.data();
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(DirectAccess);
static void EigenAccessWithOMP(benchmark::State& state)
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
#pragma omp simd collapse(4)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(EigenAccessWithOMP);
static void DirectAccessWithOMP(benchmark::State& state)
{
Eigen::Tensor<double, 3> U = ::U;
Eigen::Tensor<double, 3> L = ::L;
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
double * pR = result.data();
double * pU = U.data();
double * pL = L.data();
#pragma omp simd collapse(4) aligned(pR, pU, pL: 32) // 32: For AVX
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(DirectAccessWithOMP);
static void EigenChipAndContract(benchmark::State& state)
{
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int l = 0; l < cDim3; l++) {
result.chip(l, 2) = U.chip(l, 2).contract(L.chip(l, 2), productDims);
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(EigenChipAndContract);
static void EigenMapAndContract(benchmark::State& state)
{
Eigen::Tensor<double, 3> U = ::U;
Eigen::Tensor<double, 3> L = ::L;
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int l = 0; l < cDim3; l++) {
Eigen::TensorMap<Eigen::Tensor<double, 2>> U_chip(U.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> L_chip(L.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> result_chip(result.data() + l * cDim12 * cDim12, cDim12, cDim12);
result_chip = U_chip.contract(L_chip, productDims);
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(EigenMapAndContract);
BENCHMARK_MAIN();
EDIT for jets
After the original post was edited, the arithmetic types used are not really built-ins but rather jets. Eigen can be extended to support custom types (as briefly outlined here). However, the Eigen::Tensor::contract() function nevertheless does not "magically" support the equivalent of np.einsum('ijl,jkl->ikl', U, L) since the last dimension l does not really perform a contraction. Of course, one could write one, but this seems far from trivial.
If the only required contraction-like operation is the one from the original post, and also the tensors are not further multiplied/added/etc, the simplest thing to do is to implement the single loop manually and play around with compilers, compiler settings, pragmas, etc to figure out the best performance.
Jet type (adapted from here):
template<int N> struct Jet {
double a = 0.0;
Eigen::Matrix<double, 1, N> v = Eigen::Matrix<double, 1, N>::Zero();
};
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a + g.a, f.v + g.v};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a * g.a, f.a * g.v + f.v * g.a};
}
For example (column-major)
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
SetToZero(result);
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
or with row-major order:
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> L = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> U = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> result(cDim12, cDim12, cDim3);
SetToZero(result);
for (int i = 0; i < cDim12; i++) {
for (int k = 0; k < cDim12; k++) {
for (int j = 0; j < cDim12; j++) {
for (int l = 0; l < cDim3; l++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
gcc and clang yield the same performance. They auto-vectorize the column-major loops, but apparently not the row-major ones. Direct access of the underlying data does not improve things. Moreover, adding #pragma omp simd collapse(4) results in worse performance in both cases (clang also warns that the loops could not be vectorized); I guess the explicit SIMDs used in the matrix multiplication of Jet::v internally by Eigen are the reason.
As an additional note again: The Eigen documentation says that you shouldn't really combine row-major order with Eigen::Tensor:
The tensor library supports 2 layouts: ColMajor (the default) and RowMajor. Only the default column major layout is currently fully supported, and it is therefore not recommended to attempt to use the row major layout at the moment.
Full code:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <unsupported/Eigen/CXX11/Tensor>
#include <benchmark/benchmark.h>
static constexpr int cDim12 = 2;
static constexpr int cDim3 = 136;
template<int N> struct Jet {
double a = 0.0;
Eigen::Matrix<double, 1, N> v = Eigen::Matrix<double, 1, N>::Zero();
};
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a + g.a, f.v + g.v};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a - g.a, f.v - g.v};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a * g.a, f.a * g.v + f.v * g.a};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a)};
}
static constexpr int N = 8;
template <Eigen::StorageOptions storage>
auto CreateRandomTensor()
{
Eigen::Tensor<Jet<N>, 3, storage> result(cDim12, cDim12, cDim3);
for (int l = 0; l < cDim3; l++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> jet;
jet.a = (double)rand() / RAND_MAX;
jet.v.setRandom();
result(i, k, l) = jet;
}
}
}
return result;
}
template <class T>
void SetToZero(T & result)
{
for (int l = 0; l < cDim3; l++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) = Jet<N>{};
}
}
}
}
static void EigenAccessNoOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(EigenAccessNoOMP);
static void EigenAccessNoOMPRowMajor(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> L = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> U = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
for (int i = 0; i < cDim12; i++) {
for (int k = 0; k < cDim12; k++) {
for (int j = 0; j < cDim12; j++) {
for (int l = 0; l < cDim3; l++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(EigenAccessNoOMPRowMajor);
static void DirectAccessNoOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
Jet<N> * pR = result.data();
Jet<N> * pU = U.data();
Jet<N> * pL = L.data();
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = pR[i + cDim12*(k + cDim12*l)];
r = r + pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(DirectAccessNoOMP);
static void EigenAccessWithOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
#pragma omp simd collapse(4)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(EigenAccessWithOMP);
static void DirectAccessWithOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
Jet<N> * pR = result.data();
Jet<N> * pU = U.data();
Jet<N> * pL = L.data();
#pragma omp simd collapse(4) aligned(pR, pU, pL: 32)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = pR[i + cDim12*(k + cDim12*l)];
r = r + pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(DirectAccessWithOMP);
BENCHMARK_MAIN();
Minimum Working Example
Here's a working example. See godbolt.org to run the code.
#include <Eigen/Dense>
#include <unsupported/Eigen/CXX11/Tensor>
int main() {
// Setup tensors
Eigen::Tensor<double, 3> U(2, 2, 136);
Eigen::Tensor<double, 3> L(2, 2, 136);
// Fill with random vars
U.setRandom();
L.setRandom();
// Create a vector of dimension pairs you want to contract over
// Since j is the second index in the second tensor (U) we specify index 1, since j is
// the first index for the second tensor (L), we specify index 0.
Eigen::array<Eigen::IndexPair<int>, 1> contraction_dims = {Eigen::IndexPair<int>(1,0)};
// Perform contraction and save result
Eigen::Tensor<double, 3> result = U.contract(L, contraction_dims);
}
Vectorization
Vectorization is a tricky thing. You'll likely want to compile the code with -O3 -fopt-info-vec-missed, where -fopt-info-vec-missed will print out very detailed information about what vectorizations were missed. If you really really want further information about why your compiler is/isn't optimizing things the way you hoped, checkout tools like optview2 and this great talk from CPPCON by Ofek Shilon. Hope this helps.
Related
How to make a template cycle?
I'm very often fiddling with cycle and they're almost the same, I think you can simplify a lot of code if you have one template. // the blocks can be different, but the number is known before compilation const int block_1 = 10, block_2 = 4, block_3 = 6, block_4 = 3; Basically all cycles are like this the cycle can be like this for (int i = 1; i < block_1 - 1; ++i) { } or this for (int i = 1; i < block_1 - 1; ++i) { for (int k = 1; k < block_2 - 1; ++k) { } } or this for (int i = 1; i < block_1 - 1; ++i) { for (int k = 1; k < block_2 - 1; ++k) { for (int j = 1; j < block_3 - 1; ++j) { } } } The number of cycle within a cycle can be a lot, but they are similar. I think that if I use a template instead of loops all the time, would it be more convenient or not, but maybe I shouldn't and you will dissuade me from doing it. Ideally I would like a template like this for_funk(block_1, block_2, block_3) { // Here I write the code that will be inside the three loops block_1, block_2, block_3 } Maybe this will help https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2374r4.html
Yes, you can compose iota_view and cartesian_product_view to get nested indexes in C++23 constexpr inline auto for_funk = [](auto... index) { return std::views::cartesian_product(std::views::iota(1, index-1)...); }; const int block_1 = 10, block_2 = 4, block_3 = 6, block_4 = 3; for (auto [i, j, k, w] : for_funk(block_1, block_2, block_3, block_4)) // use i, j, k, w Demo with range-v3
Optimizing square matrix multiplication with std::thread
I'm trying to implement matrix multiplying with std::thread in C++. Currently, my kernel code looks like void multiply(const int* a, const int* b, int* c, int rowLength, int start) { for (auto i = start; i < rowLength; i += threadCount) { const auto rowI = i * rowLength; for (auto j = 0; j < rowLength; j++) { auto result = 0; const auto rowJ = j * rowLength; for (auto k = 0; k < rowLength; k++) { result += a[rowI + k] * b[rowJ + k]; } c[rowI + j] = result; } } } As you see, I'm multiplying matrix A with already transposed matrix B (it's done during input). Currently, I'm trying to use one-dimension approach. Is there any optimizations that can I make with my current code?
How to Fix LU Decompostion?
I wrote the code according to the algorithm, but the result is incorrect. According to the algorithm, we must indicate the dimension of the matrix and manually fill in the main matrix A and vector B. We need to generate an LU matrix. It is generated, but with the wrong numbers. And in the end we have to get the vector X with solutions. And this is in windowed mode. https://imgur.com/TSsjMXp int N = 1; // matrix dimension double R = 0; typedef double Matrix [6][6]; typedef double Vec [6]; . . . void Decomp (Matrix A, int N, int &Change) { int i, j, k ; double R, L, U; Change = 1; R = Math::Abs(A[1][1]); for(j=2; j<=N; j++) if (Math::Abs(A[j][1])>= R) { Change = j; R = Math::Abs(A[j][1]); } if (R<= 1E-7) { MessageBox::Show("The system is degenerate"); } if (k!=1) { for(i=1; i<=N; i++) { R = A[Change][i]; A[Change][i] = A[1][i]; A[1][i] = R; } } for(i=2; i<=N; i++) A[1][i] = A[1][i]/A[1][1]; for(i=2; i<=N; i++) { for(k=i; k<=N; k++); { R = 0; for ( j=1; j<=(i-1); j++) R = R + A[k][j] * A[j][i]; A[k][i] = A[k][i] - R; } if (A[i][i]<= 1E-7) { MessageBox::Show("The system is degenerate[enter image description here][1]"); } for(k = i+1; k<=N; k++) { R = 0; for (j=1; j<=(i-1); j++) R = R + A[i][j] * A[j][k]; A[i][k] = (A[i][k] - R) / A[i][i]; } } for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { C_matrix_dgv->Rows[i]->Cells[j] -> Value = Convert::ToString(A[i+1][j+1]); } } void Solve (Matrix A, Vec b, Vec x, int Change, int N) { int i = 0,j = 0; double R; if (Change!=1) { R = b[Change]; b[Change] = b[1]; b[1] = R; } b[1] = b[1]/A[1][1]; for(i=2; i<=N; i++) { R = 0; for( j=1; j<=(i-1); j++) R = R + A[i][j] * b[j]; b[i] = (b[i] - R) / A[i][i]; } x[N] = b[N]; for( i=1; i<=(N-1); i++) { R = 0; for(j = (N+1-i); j<=N; j++) R = R + A[N - i][j] * x[j]; x[N - i] = b[N - i] - R; } }
int N = 1; // matrix dimension If you use this in the rest of the code you cannot get correct results. The dimension of the matrix is 6x6. Use a std::array or std::vector so that you dont need to keep the size in a seperate variable.
How to do FFT on MatrixXd in Eigen?
It seems the code below is correct: #include <Eigen/Core> #include <unsupported/Eigen/FFT> int main () { Eigen::FFT<float> fft; Eigen::Matrix<float, dim_x, dim_y> in = setMatrix(); Eigen::Matrix<complex<float>, dim_x, dim_y> out; for (int k = 0; k < in.rows(); k++) { Eigen::Matrix<complex<float>, dim_x, 1> tmpOut; fft.fwd(tmpOut, in.row(k)); out.row(k) = tmpOut; } for (int k = 0; k < in.cols(); k++) { Eigen::Matrix<complex<float>, 1, dim_y> tmpOut; fft.fwd(tmpOut, out.col(k)); out.col(k) = tmpOut; } } But this must specify the size of matrix in compile time, when I change the Matrix to MatrixXd, this has error when compiling. I want to know how could I do FFT on MatrixXd so I could specify the matrix size when it is running.
Change all your variables to Eigen::Dynamic size instead of hard coding them and it should work. Or, use the built-in types as such: #include <Eigen/Core> #include <unsupported/Eigen/FFT> int main () { size_t dim_x = 28, dim_y = 126; Eigen::FFT<float> fft; Eigen::MatrixXf in = Eigen::MatrixXf::Random(dim_x, dim_y); Eigen::MatrixXcf out; out.setZero(dim_x, dim_y); for (int k = 0; k < in.rows(); k++) { Eigen::VectorXcf tmpOut(dim_x); fft.fwd(tmpOut, in.row(k)); out.row(k) = tmpOut; } for (int k = 0; k < in.cols(); k++) { Eigen::VectorXcf tmpOut(dim_y); fft.fwd(tmpOut, out.col(k)); out.col(k) = tmpOut; } return 0; }
Apply memmove function to a 3d array
I am trying to achieve the fftshift function (from MATLAB) in c++ with for loop and it's really time-consuming. here is my code: const int a = 3; const int b = 4; const int c = 5; int i, j, k; int aa = a / 2; int bb = b / 2; int cc = c / 2; double ***te, ***tempa; te = new double **[a]; tempa = new double **[a]; for (i = 0; i < a; i++) { te[i] = new double *[b]; tempa[i] = new double *[b]; for (j = 0; j < b; j++) { te[i][j] = new double [c]; tempa[i][j] = new double [c]; for (k = 0; k < c; k++) { te[i][j][k] = i + j+k; } } } /*for the row*/ if (c % 2 == 1) { for (i = 0; i < a; i++) { for (j = 0; j < b; j++) { for (k = 0; k < cc; k++) { tempa[i][j][k] = te[i][j][k + cc + 1]; tempa[i][j][k + cc] = te[i][j][k]; tempa[i][j][c - 1] = te[i][j][cc]; } } } } else { for (i = 0; i < a; i++) { for (j = 0; j < b; j++) { for (k = 0; k < cc; k++) { tempa[i][j][k] = te[i][j][k + cc]; tempa[i][j][k + cc] = te[i][j][k]; } } } } for (i = 0; i < a; i++) { for (j = 0; j < b; j++) { for (k = 0; k < c; k++) { te[i][j][k] = tempa[i][j][k]; } } } /*for the column*/ if (b % 2 == 1) { for (i = 0; i < a; i++) { for (j = 0; j < bb; j++) { for (k = 0; k < c; k++) { tempa[i][j][k] = te[i][j + bb + 1][k]; tempa[i][j + bb][k] = te[i][j][k]; tempa[i][b - 1][k] = te[i][bb][k]; } } } } else { for (i = 0; i < a; i++) { for (j = 0; j < bb; j++) { for (k = 0; k < c; k++) { tempa[i][j][k] = te[i][j + bb][k]; tempa[i][j + bb][k] = te[i][j][k]; } } } } for (i = 0; i < a; i++) { for (j = 0; j < b; j++) { for (k = 0; k < c; k++) { te[i][j][k] = tempa[i][j][k]; } } } /*for the third dimension*/ if (a % 2 == 1) { for ( i = 0; i < aa; i++) { for (j = 0; j < b; j++) { for ( k = 0; k < c; k++) { tempa[i][j][k] = te[i + aa + 1][j][k]; tempa[i + aa][j][k] = te[i][j][k]; tempa[a - 1][j][k] = te[aa][j][k]; } } } } else { for (i = 0; i < aa; i++) { for ( j = 0; j < b; j++) { for ( k = 0; k < c; k++) { tempa[i][j][k] = te[i + aa][j][k]; tempa[i + aa][j][k] = te[i][j][k]; } } } } for (i = 0; i < a; i++) { for (j = 0; j < b; j++) { for (k = 0; k < c; k++) { cout << te[i][j][k] << ' '; } cout << endl; } cout << "\n"; } cout << "and then" << endl; for (i = 0; i < a; i++) { for (j = 0; j < b; j++) { for (k = 0; k < c; k++) { cout << tempa[i][j][k] << ' '; } cout << endl; } cout << "\n"; } now I want to rewrite it with memmove to improve the running efficiency. For the 3rd dimension, I use: memmove(tempa, te + aa, sizeof(double)*(a - aa)); memmove(tempa + aa+1, te, sizeof(double)* aa); this code can works well with 1d and 2d array, but doesn't work for the 3d array. Also, I do not know how to move the column and row elements with memmove. Anyone can help me with all of these? thanks so much!! Now I have modified the code as below: double ***te, ***tempa1,***tempa2, ***tempa3; te = new double **[a]; tempa1 = new double **[a]; tempa2 = new double **[a]; tempa3 = new double **[a]; for (i = 0; i < a; i++) { te[i] = new double *[b]; tempa1[i] = new double *[b]; tempa2[i] = new double *[b]; tempa3[i] = new double *[b]; for (j = 0; j < b; j++) { te[i][j] = new double [c]; tempa1[i][j] = new double [c]; tempa2[i][j] = new double [c]; tempa3[i][j] = new double [c]; for (k = 0; k < c; k++) { te[i][j][k] = i + j+k; } } } /*for the third dimension*/ memmove(tempa1, te + (a-aa), sizeof(double**)*aa); memmove(tempa1 + aa, te, sizeof(double**)* (a-aa)); //memmove(te, tempa, sizeof(double)*a); /*for the row*/ for (i = 0; i < a; i++) { memmove(tempa2[i], tempa1[i] + (b - bb), sizeof(double*)*bb); memmove(tempa2[i] + bb, tempa1[i], sizeof(double*)*(b - bb)); } /*for the column*/ for (j = 0; i < a; i++) { for (k = 0; j < b; j++) { memmove(tempa3[i][j], tempa2[i][j] + (c - cc), sizeof(double)*cc); memmove(tempa3[i][j] + cc, tempa2[i][j], sizeof(double)*(c-cc)); } } but the problem is that I define too much new dynamic arrays and also the results for tempa3 are incorrect. could anyone give some suggestions?
I believe you want something like that: memmove(tempa, te + (a - aa), sizeof(double**) * aa); memmove(tempa + aa, te, sizeof(double**) * (a - aa)); or memmove(tempa, te + aa, sizeof(double**) * (a - aa)); memmove(tempa + (a - aa), te, sizeof(double**) * aa); depending on whether you want to swap the first half "rounded up or down" (I assume you want it rounded up, it's the first version then). I don't really like your code's design though: First and foremost, avoid dynamic allocation and use std::vector or std::array when possible. You could argue it would prevent you from safely using memmove instead of swap for the first dimensions (well, it should work, but I'm not 100% sure it isn't implementation defined) but I don't think that would improve that much the efficiency. Besides, if you want to have a N-dimensional array, I usually prefer avoiding "chaining pointers" (although with your algorithm, you can actually use this structure, so it's not that bad). For instance, if you're adamant about dynamically allocating your array with new, you might use something like that instead to reduce memory usage (the difference might be neglectible though; it's also probably slightly faster but again, probably neglectible): #include <cstddef> #include <iostream> typedef std::size_t index_t; constexpr index_t width = 3; constexpr index_t height = 4; constexpr index_t depth = 5; // the cells (i, j, k) and (i, j, k+1) are adjacent in memory // the rows (i, j, _) and (i, j+1, _) are adjacent in memory // the "slices" (i, _, _) and (i+1, _, _) are adjacent in memory constexpr index_t cell_index(index_t i, index_t j, index_t k) { return (i * height + j) * depth + k; } int main() { int* array = new int[width * height * depth](); for( index_t i = 0 ; i < width ; ++i ) for( index_t j = 0 ; j < height ; ++j ) for( index_t k = 0 ; k < depth ; ++k ) { // do something on the cell (i, j, k) array[cell_index(i, j, k)] = i + j + k; std::cout << array[cell_index(i, j, k)] << ' '; } std::cout << '\n'; // alternatively you can do this: //* for( index_t index = 0 ; index < width * height * depth ; ++index) { index_t i = index / (height * depth); index_t j = (index / depth) % height; index_t k = index % depth; array[index] = i + j + k; std::cout << array[index] << ' '; } std::cout << '\n'; //*/ delete[] array; } The difference is the organization in memory. Here you have a big block of 60*sizeof(int) bytes (usually 240 or 480 bytes), whereas with your method you would have: - 1 block of 3*sizeof(int**) bytes - 3 blocks of 4*sizeof(int*) bytes - 12 blocks of 5*sizeof(int) bytes (120 more bytes on a 64 bit architecture, two additional indirections for each cell access, and more code for allocating/deallocating all that memory) Granted, you can't do array[i][j][k] anymore, but still... The same stands with vectors (you can either make an std::vector<std::vector<std::vector<int>>> or a std::vector<int>) There is also a bit too much code repetition: your algorithm basically swaps the two halves of your table three times (once for each dimension), but you rewrote 3 times the same thing with a few differences. There is also too much memory allocation/copy (your algorithm works and can exploit the structure of array of pointers by simply swapping pointers to swap whole rows/slices, in that specific case, you can exploit this data structure to avoid copies with your algorithm... but you don't) You should choose more explicit variable names, that helps. For instance use width, height, depth instead of a, b, c. For instance, here is an implementation with vectors (I didn't know matlab's fftshift function though, but according to your code and this page, I assume it's basically "swapping the corners"): (also, compile with -std=c++11) #include <cstddef> #include <iostream> #include <vector> #include <algorithm> typedef std::size_t index_t; typedef double element_t; typedef std::vector<element_t> row_t; typedef std::vector<row_t> slice_t; typedef std::vector<slice_t> array_3d_t; // for one dimension // you might overload this for a std::vector<double>& and use memmove // as you originally wanted to do here template<class T> void fftshift_dimension(std::vector<T>& row) { using std::swap; const index_t size = row.size(); if(size <= 1) return; const index_t halved_size = size / 2; // swap the two halves for(index_t i = 0, j = size - halved_size ; i < halved_size ; ++i, ++j) swap(row[i], row[j]); // if the size is odd, rotate the right part if(size % 2) { swap(row[halved_size], row[size - 1]); const index_t n = size - 2; for(index_t i = halved_size ; i < n ; ++i) swap(row[i], row[i + 1]); } } // base case template<class T> void fftshift(std::vector<T>& array) { fftshift_dimension(array); } // reduce the problem for a dimension N+1 to a dimension N template<class T> void fftshift(std::vector<std::vector<T>>& array) { fftshift_dimension(array); for(auto& slice : array) fftshift(slice); } // overloads operator<< to print a 3-dimensional array std::ostream& operator<<(std::ostream& output, const array_3d_t& input) { const index_t width = input.size(); for(index_t i = 0; i < width ; i++) { const index_t height = input[i].size(); for(index_t j = 0; j < height ; j++) { const index_t depth = input[i][j].size(); for(index_t k = 0; k < depth; k++) output << input[i][j][k] << ' '; output << '\n'; } output << '\n'; } return output; } int main() { constexpr index_t width = 3; constexpr index_t height = 4; constexpr index_t depth = 5; array_3d_t input(width, slice_t(height, row_t(depth))); // initialization for(index_t i = 0 ; i < width ; ++i) for(index_t j = 0 ; j < height ; ++j) for(index_t k = 0 ; k < depth ; ++k) input[i][j][k] = i + j + k; std::cout << input; // in place fftshift fftshift(input); std::cout << "and then" << '\n' << input; } live example You could probably make a slightly more efficient algorithm by avoiding to swap multiple times the same cell and/or using memmove, but I think it's already fast enough for many uses (on my machine fftshift takes roughly 130ms for a 1000x1000x100 table).