DSYEV and DSYEVD for sparse matrix diagonalisation - fortran

So i have come to a point where DSYEVD is becoming impractical, due to it's higher memory requirements - it will need at least 240GB ram to diagonalise my matrix, so i'm considering moving to the DSYEV routine, which requires less memory;
DSYEVD requires (1 + 6N + 3N^2) Words while DSYEV requires just 3N + N^2.
The smaller matrix i have has dimension ~100,000 and the larger has dimension ~230,000. Obviously the second will have to be done using DSYEV, but the first i can do using either.
I had a quick look on google, but couldn't find any relevant benchmarks - the only ones available have N=10-1000, and are dense matrices. My matrices are 100K dimension and 90% sparse, and 230K dimension at 95% sparse.
Does anyone know how long i can expect this to take?
I have already taken the first 3000 eigen vectors and values from each using DSYEVD, but now i need to take all of them. When only taking 3K vectors, using an iterative scheme, it took ~1.6 days for the 100K matrix, and ~3 days for the 230K matrix. The jobs were run on an Intel(R) Xeon(R) CPU E5-4640 0 # 2.40GHz with 256GB ram.

Related

Efficient stable sum of a sorted array in AVX2

Consider a sorted (ascending) array of double numbers. For numerical stability the array should be summed up as if iterating it from the beginning till the end, accumulating the sum in some variable.
How to vectorize this efficiently with AVX2?
I've looked into this method Fastest way to do horizontal vector sum with AVX instructions , but it seems quite tricky to scale it to an array (some divide&conquer approach may be needed), while keeping the floating-point precision by ensuring that small numbers are summed up before adding them to a larger number.
Clarification 1: I think it should be ok to e.g. sum the first 4 items, then add them to the sum of the next 4 items, etc. I'm willing to trade some stability for performance. But I would prefer a method that doesn't ruin the stability completely.
Clarification 2: memory shouldn't be a bottleneck because the array is in L3 cache (but not in L1/L2 cache, because pieces of the array were populated from different threads). I wouldn't like to resort to Kahan summation because I think it's really the number of operations that matters, and Kahan summation would increase it about 4 times.
If you need precision and parallelism, use Kahan summation or another error-compensation technique to let you reorder your sum (into SIMD vector element strides with multiple accumulators).
As Twofold fast summation - Evgeny Latkin points out, if you bottleneck on memory bandwidth, an error-compensated sum isn't much slower than a max-performance sum, since the CPU has lots of computation throughput that goes unused in a simply-parallelized sum that bottlenecks on memory bandwidth
See also (google results for kahan summation avx)
https://github.com/rreusser/summation-algorithms
https://scicomp.stackexchange.com/questions/10869/which-algorithm-is-more-accurate-for-computing-the-sum-of-a-sorted-array-of-numb
Is this way of processing the tail of an array with SSE overkill? has a sample implementation of SSE Kahan, not unrolled, and a comparison of actual error with it (no error) vs. sequential sum (bad) vs. simple SIMD sum (much less total error), showing that just vectorizing (and/or unrolling) with multiple accumulators tends to help accuracy.
Re: your idea: Summing groups of 4 numbers in-order will let you hide the FP-add latency, and bottleneck on scalar add throughput.
Doing horizontal sums within vectors takes a lot of shuffling, so it's a potential bottleneck. You might consider loading a0 a1 a2 a3, then shuffling to get a0+a1 x a2+a3 x, then (a0+a1) + (a2+a3). You have a Ryzen, right? The last step is just a vextractf128 down to 128b. That's still 3 total ADD uops, and 3 shuffle uops, but with fewer instructions than scalar or 128b vectors.
Your idea is very similar to Pairwise Summation
You're always going to get some rounding error, but adding numbers of similar magnitude minimizes it.
See also Simd matmul program gives different numerical results for some comments about Pairwise Summation and simple efficient SIMD.
The difference between adding 4 contiguous numbers vs. vertically adding 4 SIMD vectors is probably negligible. SIMD vectors give you small strides (of SIMD vector width) in the array. Unless the array grows extremely quickly, they're still going to have basically similar magnitudes.
You don't need to horizontal sum until the very end to still get most of the benefit. You can maintain 1 or 2 SIMD vector accumulators while you use more SIMD registers to sum short runs (of maybe 4 or 8 SIMD vectors) before adding into the main accumulators.
In fact having your total split more ways (across the SIMD vector elements) means it doesn't grow as large. So it helps with exactly the problem you're trying to avoid, and horizontal summing down to a single scalar accumulator actually makes things worse, especially for a strictly sorted array.
With out-of-order execution, you don't need very many tmp accumulators to make this work and hide the FP-add latency of accumulating into the main accumulators. You can do a couple groups of accumulating into a fresh tmp = _mm_load_ps() vector and adding that to the total, and OoO exec will overlap those executions. So you don't need a huge unroll factor for your main loop.
But it shouldn't be too small, you don't want to bottleneck on the add latency into the main accumulator, waiting for the previous add to produce a result before the next one can start. You want to bottleneck on FP-add throughput. (Or if you care about Broadwell/Haswell and you don't totally bottleneck on memory bandwidth, mix in some FMA with a 1.0 multiplier to take advantage of that throughput.)
e.g. Skylake SIMD FP add has 4 cycle latency, 0.5 cycle throughput, so you need to be doing at least 7 adds that are part of a short dep chain for every add into a single accumulator. Preferably more, and/or preferably with 2 long-term accumulators to better absorb bubbles in scheduling from resource conflicts.
See _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps? for more about multiple accumulators.
Here's my solution so far:
double SumVects(const __m256d* pv, size_t n) {
if(n == 0) return 0.0;
__m256d sum = pv[0];
if(n == 1) {
sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
} else {
for(size_t i=1; i+1 < n; i++) {
sum = _mm256_hadd_pd(sum, pv[i]);
sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
}
sum = _mm256_hadd_pd(sum, pv[n-1]);
}
const __m128d laneSums = _mm_hadd_pd(_mm256_extractf128_pd(sum, 1),
_mm256_castpd256_pd128(sum));
return laneSums.m128d_f64[0] + laneSums.m128d_f64[1];
}
Some explanation: it adds adjacent double array items first, such as a[0]+a[1], a[2]+a[3], etc. Then it adds the sums of adjacent items.
The games you want to play are likely counterproductive. Try experimenting by generating a bunch of iid samples from your favourite distribution, sorting them, and comparing "sum in increasing order" with "sum each lane in increasing order, then sum the lane sums."
For 4 lanes and 16 data, summing lanewise gives me smaller error about 28% of the time while summing in increasing order gives me smaller error about 17% of the time; for 4 lanes and 256 data, summing lanewise gives me smaller error about 68% of the time while summing in increasing order gives me smaller error about 12% of the time. Summing lanewise also beats the algorithm you gave in your self-answer. I used a uniform distribution on [0,1] for this.

How to optimize this product of three matrices in C++ for x86?

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:
A*A'*Y, where: A is an m-by-n matrix,
A' is its conjugate transpose,
Y is an m-by-k matrix
Typical characteristics:
- k is much smaller than both m or n (k is typically < 10)
- m in the range [500, 2000]
- n in the range [100, 1000]
Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.
My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.
However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:
Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.
A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.
Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.
Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).
At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.
This is not a full answer (yet - and I'm not sure it will become one).
Let's think of the math first a little. Since matrix multiplication is associative we can either do
(A*A')Y or A(A'*Y).
Floating point operations for (A*A')*Y
2*m*n*m + 2*m*m*k //the twos come from addition and multiplication
Floating point operations for A*(A'*Y)
2*m*n*k + 2*m*n*k = 4*m*n*k
Since k is much smaller than m and n it's clear why the second case is much faster.
But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to
m*n*m + 2*m*m*k.
We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:
z*z*z + 2*z*z*k = 4*z*z*k //now simplify
z = 2*k.
So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.
In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).
I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.
You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is
frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)
For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.
Use a temporary matrix to compute A'*Y, but make sure you tell eigen that there's no aliasing going on: temp.noalias() = A.adjoint()*Y. Then compute your result, once again telling eigen that objects aren't aliased: result.noalias() = A*temp.
There would be redundant computation only if you would perform (A*A')*Y since in this case (A*A') is symmetric and only half of the computation are required. However, as you noticed it is still much faster to perform A*(A'*Y) in which case there is no redundant computations. I confirm that the cost of the temporary creation is completely negligible here.
I guess that perform the following
result = A * (A.adjoint() * Y)
will be the same as do that
temp = A.adjoint() * Y
result = A * temp;
If your matrix Y fits in the cache, you can probably take advantage of doing it like that
result = A * (Y.adjoint() * A).adjoint()
or, if the previous notation is not allowed, like that
temp = Y.adjoint() * A
result = A * temp.adjoint();
Then you dont need to do the adjoint of matrix A, and store the temporary adjoint matrix for A, that will be much expensive that the one for Y.
If your matrix Y fits in the cache, it should be much faster doing a loop runing over the colums of A for the first multiplication, and then over the rows of A for the second multimplication (having Y.adjoint() in the cache for the first multiplication and temp.adjoint() for the second), but I guess that internally eigen is already taking care of that things.

Amortized analysis of std::vector insertion

How do we do the analysis of insertion at the back (push_back) in a std::vector? It's amortized time is O(1) per insertion. In particular in a video in channel9 by Stephan T Lavavej and in this ( 17:42 onwards ) he says that for optimal performance Microsoft's implementation of this method increases capacity of the vector by around 1.5.
How is this constant determined?
Assuming you mean push_back and not insertion, I believe that the important part is the multiply by some constant (as opposed to grabbing N more elements each time) and as long as you do this you'll get amortized constant time. Changing the factor changes the average case and worst case performance.
Concretely:
If your constant factor is too large, you'll have good average case performance, but bad worst case performance especially as the arrays get big. For instance, imagine doubling (2x) a 10000 size vector just because you have the 10001th element pushed. EDIT: As Michael Burr indirectly pointed out, the real cost here is probably that you'll grow your memory much larger than you need it to be. I would add to this that there are cache issues that affect speed if your factor is too large. Suffice it to say that there are real costs (memory and computation) if you grow much larger than you need.
However if your constant factor is too small, say (1.1x) then you're going to have good worst case performance, but bad average performance, because you're going to have to incur the cost of reallocating too many times.
Also, see Jon Skeet's answer to a similar question previously. (Thanks #Bo Persson)
A little more about the analysis: Say you have n items you are pushing back and a multiplication factor of M. Then the number of reallocations will be roughly log base M of n (log_M(n)). And the ith reallocation will cost proportional to M^i (M to the ith power). Then the total time of all the pushbacks will be M^1 + M^2 + ... M^(log_M(n)). The number of pushbacks is n, and thus you get this series (which is a geometric series, and reduces to roughly (nM)/(M-1) in the limit) divided by n. This is roughly a constant, M/(M-1).
For large values of M you will overshoot a lot and allocate much more than you need reasonably often (which I mentioned above). For small values of M (close to 1) this constant M/(M-1) becomes large. This factor directly affects the average time.
You can do the math to try to figure how this kind of thing works.
A popular method to work with asymptotic analysis is the Bankers method. What you do is markup all your operations with an extra cost, "saving" it for later to pay for an expensive operation latter on.
Let's make some dump assumptions to simplify the math:
Writing into an array costs 1. (Same for inserting and moving between arrays)
Allocating a larger array is free.
And our algorithm looks like:
function insert(x){
if n_elements >= maximum array size:
move all elements to a new array that
is K times larger than the current size
add x to array
n_elements += 1
Obviously, the "worst case" happens when we have to move the elements to the new array. Let's try to amortize this by adding a constant markup of d to the insertion cost, bringing it to a total of (1 + d) per operation.
Just after an array has been resized, we have (1/K) of it filled up and no money saved.
By the time we fill the array up, we can be sure to have at least d * (1 - 1/K) * N saved up. Since this money must be able to pay for all N elements being moved, we can figure out a relation between K and d:
d*(1 - 1/K)*N = N
d*(K-1)/K = 1
d = K/(K-1)
A helpful table:
k d 1+d(total insertion cost)
1.0 inf inf
1.1 11.0 12.0
1.5 3.0 4.0
2.0 2.0 3.0
3.0 1.5 2.5
4.0 1.3 2.3
inf 1.0 2.0
So from this you can get a rough mathematician's idea of how the time/memory tradeoff works for this problem. There are some caveats, of course: I didn't go over shrinking the array when it gets less elements, this only covers the worst case where no elements are ever removed and the time costs of allocating extra memory weren't accounted for.
They most likely ran a bunch of experimental tests to figure this out in the end making most of what I wrote irrelevant though.
Uhm, the analysis is really simple when you're familiar with number systems, such as our usual decimal one.
For simplicity, then, assume that each time the current capacity is reached, a new 10x as large buffer is allocated.
If the original buffer has size 1, then the first reallocation copies 1 element, the second (where now the buffer has size 10) copies 10 elements, and so on. So with five reallocations, say, you have 1+10+100+1000+10000 = 11111 element copies performed. Multiply that by 9, and you get 99999; now add 1 and you have 100000 = 10^5. Or in other words, doing that backwards, the number of element copies performed to support those 5 reallocations has been (10^5-1)/9.
And the buffer size after 5 reallocations, 5 multiplications by 10, is 10^5. Which is roughly a factor of 9 larger than the number of element copy operations. Which means that the time spent on copying is roughly linear in the resulting buffer size.
With base 2 instead of 10 you get (2^5-1)/1 = 2^5-1.
And so on for other bases (or factors to increase buffer size by).
Cheers & hth.

Speed up float 5x5 matrix * vector multiplication with SSE

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?
The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.
In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.
I would suggest using Intel IPP and abstract yourself of dependency on techniques
If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.
This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.
What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...
I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.
If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

Trilinos sparse block matrix anomalous memory consumption

I am building an application based on distributed linear algebra using Trilinos, the main issue is that memory consumption is much higher than expected.
I have built a simple test case for building an Epetra::VbrMatrix with 1.5 million doubles grouped as 5 millions blocks of 3 doubles, which should be about 115MB.
After building the matrix on 2 processors, half data each, I get a memory consumption of 500MB on each processor, which is about 7.5 times the data, it looks unreasonable to me, the matrix should just have some integer arrays for locating the nonzero blocks.
I asked on the trilinos-users mailing list, they say memory usage looks too high, but hope to have some more help here.
I tested both on my laptop with Ubuntu + gcc 4.4.5 + Trilinos 10.0 and on a cluster with PGI compiler and Trilinos 10.4.0, the result is about the same.
My test code is on gist https://gist.github.com/848310, where I also wrote memory consumption at different stage in my testing with 2 MPI processes on my laptop.
If anybody has any suggestion that would be really helpful. Also if you could even just build, run and report memory consumption it would be great.
answer by Alan Williams form the trilinos-users list, in short VBRmatrix is not suitable for such small blocks, as the storage overhead is bigger than the data themselves:
The VbrMatrix storage format
definitely incurs some storage
overhead, as compared to the simple
number of double-precision values
being stored.
In your program, you are storing
5,000,000 X 1 X 3 == 15 million
doubles. With 8 bytes per double, that
is 120 million bytes.
The matrix class Epetra_VbrMatrix
(which is the base class for
Epetra_FEVbrMatrix) internally stores
a Epetra_CrsGraph object, which
represents the sparsity structure.
This requires a couple of integers per
block-row, and 1 integer per
block-nonzero. (Your case has 5
million block-rows with 1
block-nonzero per row, so at least 15
million integers in total.)
Additionally, the Epetra_VbrMatrix
class stores a
Epetra_SerialDenseMatrix object for
each block-nonzero. This adds a couple
of integers, plus a bool and a
pointer, for each block-nonzero. In
your case, since your block-nonzeros
are small (1x3 doubles), this is a
substantial overhead. The VbrMatrix
format has proportionally less
overhead the bigger your
block-nonzeros are. But in your case,
with 1x3 blocks, the VbrMatrix is
indeed occupying several times more
memory than is required for the
15million doubles.