Julia vs c++ performance almost factor 30 - c++

The Julia program below takes about 6 seconds on my laptop (second test(n)). An equivalent C++ program (using Eigen) only 0.19 s. According to the results I have seen on https://programming-language-benchmarks.vercel.app/cpp, I expected a much smaller difference. What is wrong with my Julia program? I would be very appreciative for hints on how to improve my Julia program.
using StaticArrays
using Printf
struct CoordinateTransformation
b1::SVector{3,Float64}
b2::SVector{3,Float64}
b3::SVector{3,Float64}
r0::SVector{3,Float64}
mf::SMatrix{3,3,Float64}
mb::SMatrix{3,3,Float64}
end
function dot(a::SVector{3,Float64}, b::SVector{3,Float64})
a[1]*b[1] + a[2]*b[2] + a[3]*b[3]
end
function CoordinateTransformation(b1::SVector{3,Float64}, b2::SVector{3,Float64}, b3::SVector{3,Float64}, r0::SVector{3,Float64})
mf = MMatrix{3,3,Float64}(undef)
e1::SVector{3, Float64} = [1.0, 0.0, 0.0]
e2::SVector{3, Float64} = [0.0, 1.0, 0.0]
e3::SVector{3, Float64} = [0.0, 0.0, 1.0]
mf[1, 1] = dot(b1, e1);
mf[1, 2] = dot(b1, e2);
mf[1, 3] = dot(b1, e3);
mf[2, 1] = dot(b2, e1);
mf[2, 2] = dot(b2, e2);
mf[2, 3] = dot(b2, e3);
mf[3, 1] = dot(b3, e1);
mf[3, 2] = dot(b3, e2);
mf[3, 3] = dot(b3, e3);
mb = inv(mf)
CoordinateTransformation(b1, b2, b3, r0, mf, mb)
end
#inline function transform_point_f(at::CoordinateTransformation, v::MVector{3,Float64})
at.mf * v + at.r0
end
#inline function transform_point_b(at::CoordinateTransformation, v::MVector{3,Float64})
at.mb * (v - at.r0)
end
#inline function transform_vector_f(at::CoordinateTransformation, v::MVector{3,Float64})
at.mf * v
end
#inline function transform_vector_b(at::CoordinateTransformation, v::MVector{3,Float64})
at.mb * v
end
function test(n)
theta = 1.0;
c = cos(1.0);
s = sin(1.0);
b1::SVector{3, Float64} = [c, 0.0, s]
b2::SVector{3, Float64} = [0.0, 1.0, 0.0]
b3::SVector{3, Float64} = [-s, 0.0, c]
r0::SVector{3, Float64} = [0.0, 0.0, 1.0]
at::CoordinateTransformation = CoordinateTransformation(b1, b2, b3, r0)
#printf("%e\n", n)
points = Array{MVector{3, Float64}, 1}(undef, n)
#inbounds for i in 1:n
points[i] = [1.0, 0.0, 0.0]
end
#inbounds for i in 1:n
points[i] = transform_point_f(at, points[i])
end
println(points[n])
#inbounds for i in 1:n
points[i] = transform_point_b(at, points[i])
end
println(points[n])
end
n = 10000000
#timev test(n)
#timev test(n)

A major issue with your test function is that a massive number of MVectors are allocated in the 3 loops. In addition, since MVectors are mutable structs, which are reference types, the points vector is a vector of references, which is not great for performance.
Instead, I recommend changing points to a vector of SVectors and modifying the code to accommodate this (e.g., replace every MVector with SVector). In the first loop, points[i] = [1.0, 0.0, 0.0] should be changed to points[i] = SA[1.0, 0.0, 0.0] to avoid allocations from creating temporary vectors. (See also Eric's comment on this.)
Implementing these simple changes, I see an improvement from
2.523284 seconds (40.00 M allocations: 1.714 GiB, 43.11% gc time)
to
0.171544 seconds (267 allocations: 228.891 MiB)

Related

Eigen spline interpolation: zero derivatives at ends

I need to interpolate a tabled function s.t. the resulting spline has zero derivatives at ends of interval. I wrote the example using InterpolateWithDerivatives function, but the resulting spline doesn't cross the given points:
typedef Eigen::Spline<double,1> Spline1d;
typedef Eigen::SplineFitting<Spline1d> Spline1dFitting;
void test_spline()
{
Eigen::VectorXd x(5);
Eigen::VectorXd y(5);
x << 0.0, 0.25, 0.5, 0.75, 1.0;
y << 0.0, 0.5, 1.0, 0.5, 0.0;
Eigen::VectorXd derivatives(2);
derivatives << 0., 0.;
Eigen::VectorXi indices(2);
indices << 0, x.size() - 1;
Spline1d const& spline = Spline1dFitting::InterpolateWithDerivatives(
y.transpose(), derivatives.transpose(), indices, 3, x);
for (int i = 0; i < 5; ++ i)
std::cout << "must be 0: " << spline(x(i)) - y(i) << std::endl;
}
While without fixing derivatives it works well:
void test_spline_2()
{
Eigen::VectorXd x(5);
Eigen::VectorXd y(5);
x << 0.0, 0.25, 0.5, 0.75, 1.0;
y << 0.0, 0.5, 1.0, 0.5, 0.0;
Spline1d const& spline2 = Spline1dFitting::Interpolate(y.transpose(), 3, x);
for (int i = 0; i < 5; ++ i)
std::cout << "must be 0: " << spline2(x(i)) - y(i) << std::endl;
}
Is something wrong here?
I came across the same problem yesterday. Unfortunately there is indeed a bug in eigen. As pointed out by Andreas, the vector b was not initialized properly.
As I do not have time to track the bug in eigen, I am posting my patch here so that it may help in case someone is having the same issue.
--- /original/eigen3/unsupported/Eigen/src/Splines/SplineFitting.h 2018-09-24 10:13:26.281178488 +0200
+++ /new/eigen3/unsupported/Eigen/src/Splines/SplineFitting.h 2018-09-26 14:59:13.737373531 +0200
## -381,11 +381,12 ##
DenseIndex row = startRow;
DenseIndex derivativeIndex = derivativeStart;
+
for (DenseIndex i = 1; i < parameters.size() - 1; ++i)
{
const DenseIndex span = SplineType::Span(parameters[i], degree, knots);
- if (derivativeIndices[derivativeIndex] == i)
+ if (derivativeIndex < derivativeIndices.size() && derivativeIndices[derivativeIndex] == i)
{
A.block(row, span - degree, 2, degree + 1)
= SplineType::BasisFunctionDerivatives(parameters[i], 1, degree, knots);
## -395,8 +396,9 ##
}
else
{
- A.row(row++).segment(span - degree, degree + 1)
+ A.row(row).segment(span - degree, degree + 1)
= SplineType::BasisFunctions(parameters[i], degree, knots);
+ b.col(row++) = points.col(i);
}
}
b.col(0) = points.col(0);
Just stumbled over the same issue. There seems to be a bug in Eigen.
First example:
must be 0: 0
must be 0: 7.54792e+168
must be 0: 1.90459e+185
must be 0: 7.54792e+168
must be 0: 0
Second example:
must be 0: 0
must be 0: 0
must be 0: 0
must be 0: 0
must be 0: 0
The right hand side vector b does not get filled properly in InterpolateWithDerivatives (SplineFitting.h).
When calling lu.solve in your example, b is
0.0
0.0
1.0
1.90459157797e+185
2.06587336741e+161
0.0
0.0
I tested this, and it is fixed in the latest version of Eigen.

Create / manipulate diagonal matrices in Chapel

I have a square matrix A
use LinearAlgebra;
proc main() {
var A = Matrix(
[4.0, 0.8, 1.1, 0.0, 2.0]
,[0.8, 9.0, 1.3, 1.0, 0.0]
,[1.1, 1.3, 1.0, 0.5, 1.7]
,[0.0, 1.0, 0.5, 4.0, 1.5]
,[2.0, 0.0, 1.7, 1.5, 16.0]
);
}
And I want to construct the diagonal matrix D = 1/sqrt(a_ii). It seems like I have have to extract the diagonal, then operate on each element. I expect this matrix is be very large and sparse, if that changes the answer.
Here's a solution using the LinearAlgebra module in 1.16 (pre-release):
use LinearAlgebra;
var A = Matrix(
[4.0, 0.8, 1.1, 0.0, 2.0],
[0.8, 9.0, 1.3, 1.0, 0.0],
[1.1, 1.3, 1.0, 0.5, 1.7],
[0.0, 1.0, 0.5, 4.0, 1.5],
[2.0, 0.0, 1.7, 1.5, 16.0]
);
var S = sqrt(1.0/diag(A));
// Type required because of promotion-flatting
// See Linear Algebra documentation for more details..
var B: A.type = diag(S);
writeln(B);
Did you try this approach?
use Math;
var D: [A.domain];
forall i in D.dim( 1 ) {
D[i,i] = 1 / Math.sqrt( A[i,i] ); // ought get fused-DIV!0 protection
}
( A.T.M. <TiO>-IDE has not so far fully functional the LinearAlgebra package, so cannot show you the results live, yet hope you would enjoy the way forwards )
Here's some code that works with a sparse diagonal array in version 1.15 today without linear algebra library support:
config const n = 10; // problem size; override with --n=1000 on command-line
const D = {1..n, 1..n}, // dense/conceptual matrix size
Diag: sparse subdomain(D) = genDiag(n); // sparse diagonal matrix
// iterator that yields indices describing the diagonal
iter genDiag(n) {
for i in 1..n do
yield (i,i);
}
// sparse diagonal matrix
var DiagMat: [Diag] real;
// assign sparse matrix elements in parallel
forall ((r,c), elem) in zip(Diag, DiagMat) do
elem = r + c/10.0;
// print sparse matrix elements serially
for (ind, elem) in zip(Diag, DiagMat) do
writeln("A[", ind, "] is ", elem);
// dense array
var Dense: [D] real;
// assign from sparse to dense
forall ij in D do
Dense[ij] = DiagMat[ij];
// print dense array
writeln(Dense);

SSE works on the array that the number of the elements is not the multiple of four

everyone.
My question is if I have three arrays as following
float a[7] = {1.0, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0};
float b[7] = {2.0, 2.0, 2.0, 2.0,
2.0, 2.0, 2.0};
float c[7] = {0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0};
And I want to perform element-wise multiply operation as following
c[i] = a[i] * b[i], i = 0, 1, ..., 6
For the first four element, I can use SSE intrinsics as following
__m128* sse_a = (__m128*) &a[0];
__m128* sse_b = (__m128*) &b[0];
__m128* sse_c = (__m128*) &c[0];
*sse_c = _mm_mul_ps(*sse_a, *sse_b);
And the content in c will be
c[0] = 2.0, c[1] = 4.0, c[2] = 6.0, c[3] = 8.0
c[4] = 0.0, c[5] = 0.0, c[6] = 0.0
Remaining three numbers in index 4, 5, and 6, I use following code
to perform element-wise multiply operation
sse_a = (__m128*) &a[4];
sse_b = (__m128*) &b[4];
sse_c = (__m128*) &c[4];
float mask[4] = {1.0, 1.0, 1.0, 0.0};
__m128* sse_mask = (__m128*) &mask[0];
*sse_c = _mm_add_ps( *sse_c,
_mm_mul_ps( _mm_mul_ps(*sse_a, *sse_b), *sse_mask ) );
And the content in c[4-6] will be
c[4] = 10.0, c[5] = 12.0, c[6] = 14.0, which is the expected result.
_mm_add_ps() add four floating-point in parallel, and the first, second, and third floating-point number are allocated in index 4, 5, and 6 in array a, b, and c respectively.
But the fourth floating-point number is not allocated to the arrays.
To avoid invalid memory access, I multiply on sse_mask to make the fourth number be zero before add the result back to sse_c (array c).
But I'm wondering whether it is safe?
Many thanks.
You seem to have the mathematical operations right but I'm really not sure using casts like you do is the way to go to load and store data in __m128 vars.
Loading and storing
To load data from an array to a __m128 variable, you should use either __m128 _mm_load_ps (float const* mem_addr) or __m128 _mm_loadu_ps (float const* mem_addr) . Pretty easy to figure what's what here, but a few precisions :
For operations involving an access or manipulation of memory, you usualy have two functions doing the same thing, for exemple load and loadu . The first requires your memory to be aligned on a 16-byte boundary, while the u version does not have this requirement. If you don't know about memory alignement, use the u versions.
You also have load_ps and load_pd. The difference : the s stands for single as in single precision (good old float), the d stands for double as in double precision. Of course, you can only puts two doubles per __m128 variable, but 4 floats.
So loading data from an array is pretty easy, just do : __m128* sse_a = _mm_loadu_ps(&a[0]);. Do the same for b, but for c that really depends. If you only want to have the result of the multiplication in it, it's useless to initialize it at 0, load it, then add the result of the multiplication to it then finally get it back.
You should use the pending operation of load for storing data which is void _mm_storeu_ps (float* mem_addr, __m128 a). So once the mutliplication is done and the result in sse_c, just do _mm_storeu_ps(&c[0#, sse_c) ;
Algorithm
The idea behind using the mask is good but you have something easier : load ans store data from a[3] (same for b and c). That way, it will have 4 elements, so there will be no need to use any mask? Yes one operation has already have done on the third element but that will be completely transparent : the store operation will just replace the old value with the new one. Since both are equal, that's not a problem.
One alternative is to store 8 elements in your array even if you need only 7. That way you don't have to worry about memory being allocated or not, no need for special logic like above for the cost of 3 floats, which is nothing on all recent computers.

Vector C++ size and access elements

I have this in a code
vector<vector<double> > times(pCount, vector<double>(5,0.0));
My question is, what is size of the matrix it is allocating ? If I need to access all the values in them what can I do ?
You have a pCount × 5 matrix. The first index can be between 0 and pCount - 1 (inclusive), the second index can be between 0 and 4 (inclusive). All the values are initialized to 0.
This is because you're using the std::vector constructor whose first argument is a count n (the number of elements to initialize the vector with), and whose second argument is a value which is copied n times. So, times is a vector with pCount elements, each of which is a vector<double>. Each of those vectors is a copy of the provided vector<double>(5,0.0), which is constructed with 5 elements, each of which is 0.0.
You can get any individual value like times[3][2], or what-have-you. With C++11 or later you can iterate through all the values like this:
for (auto& v : times)
for (double& d : v)
d += 3.14; // or whatever
If you don't need to modify the values, but only access them, you can remove the ampersands, or better yet do:
for (const auto& v : times)
for (double d : v)
std::cout << d << ", "; // or whatever
Before C++11, you have to be much more wordy, or just use indices:
for (int i = 0; i < pCount; ++i)
for (int j = 0; j < 5; ++j)
times[i][j] += 3.14; // or whatever
This is equivalent in size to a standard array of [pCount][5]
double[pCount][5] = {{0.0, 0.0, 0.0, 0.0, 0.0}, // |
{0.0, 0.0, 0.0, 0.0, 0.0}, // |
{0.0, 0.0, 0.0, 0.0, 0.0}, // | pCount = 5
{0.0, 0.0, 0.0, 0.0, 0.0}, // |
{0.0, 0.0, 0.0, 0.0, 0.0}}; // |
Of course, you're using vectors, so the number of rows and columns can be variable after times is created.
std::vector includes an override for operator[], so you can access data using that operator.
auto Val = times[2][3]

arbitrary datatype ratio converter

I have following code:
template<typename I,typename O> O convertRatio(I input,
I inpMinLevel = std::numeric_limits<I>::min(),
I inpMaxLevel = std::numeric_limits<I>::max(),
O outMinLevel = std::numeric_limits<O>::min(),
O outMaxLevel = std::numeric_limits<O>::max() )
{
double inpRange = abs(double(inpMaxLevel - inpMinLevel));
double outRange = abs(double(outMaxLevel - outMinLevel));
double level = double(input)/inpRange;
return O(outRange*level);
}
the usage is something like this:
int value = convertRatio<float,int,-1.0f,1.0f>(0.5);
//value is around 1073741823 ( a quarter range of signed int)
the problem is for I=int and O=float with function default parameter:
float value = convertRatio<int,float>(123456);
the line double(inpMaxLevel - inpMinLevel) result is -1.0, and I expect it to be 4294967295 in float.
do you have any idea to do it better?
the base idea is just to convert a value from a range to another range with posibility of different data type.
Adding to romkyns answer, besides casting all values to doubles before casting to prevent overflows, your code returns wrong results when the lower bounds are distinct than 0, because you don't adjust the values appropiately. The idea is mapping the range [in_min, in_max] to the range [out_min, out_max], so:
f(in_min) = out_min
f(in_max) = out_max
Let x be the value to map. The algorithm is something like:
Map the range [in_min, in_max] to [0, in_max - in_min]. To do this, substract in_min from x.
Map the range [0, in_max - in_min] to [0, 1]. To do this, divide x by (in_max - in_min).
Map the range [0, 1] to [0, out_max - out_min]. To do this, multiply x by (out_max - out_min).
Map the range [0, out_max - out_min] to [out_min, out_max]. To do this, add out_min to x.
The following implementation in C++ does this (I will forget the default values to make the code clearer:
template <class I, class O>
O convertRatio(I x, I in_min, I in_max, O out_min, O out_max) {
const double t = ((double)x - (double)in_min) /
((double)in_max - (double)in_min);
const double res = t * ((double)out_max - (double)out_min) + out_min;
return O(res);
}
Notice that I didn't took the absolute value of the range sizes. This allows reverse mapping. For example, it makes possible to map [-1.0, 1.0] to [3.0, 2.0], giving the following results:
convertRatio(-1.0, -1.0, 1.0, 3.0, 2.0) = 3.0
convertRatio(-0.8, -1.0, 1.0, 3.0, 2.0) = 2.9
convertRatio(0.8, -1.0, 1.0, 3.0, 2.0) = 2.1
convertRatio(1.0, -1.0, 1.0, 3.0, 2.0) = 2.0
The only condition needed is that in_min != in_max (to prevent division by zero) and out_min != out_max (otherwise, all inputs will be mapped to the same point). To prevent rounding errors, try to not use small ranges.
Try
(double) inpMaxLevel - (double) inpMinLevel
instead. What you are doing currently is subtracting max from min while the numbers are still of type int - which necessarily overflows; a signed int is fundamentally incapable of representing the difference between its min and max.