Armadillo: Matrix multiplications taking up huge amounts of memory - c++

I'm trying to use armadillo to do linear regression as in the following function:
void compute_weights()
{
printf("transpose\n");
const mat &xt(X.t());
printf("inverse\n");
mat xd;
printf("mul\n");
xd = (xt * X);
printf("inv\n");
xd = xd.i();
printf("mul2\n");
xd = xd * xt;
printf("mul3\n");
W = xd * Y;
}
I've split this up so I could see what was going on with the program getting so huge. The matrix X has 64 columns and over 23 million rows. The transpose isn't too bad, but that first multiply causes the memory footprint to completely blow up. Now, as I understand it, if I multiply X.t() * X, each element of the matrix product will be the dot product of a column of X and a row of X.t(), and the result should be a 64x64 matrix.
Sure, it should take a long time, but why would the memory suddenly blow up to nearly 30 gigabytes?
Then it seems to hang on to that memory, and then when it gets to the second multiply, it's just too much, and the OS kills it for getting so huge.
Is there a way to compute products without so much memory usage? Can that memory be reclaimed? Is there a better way to represent these calculations?

You don't stand a chance doing this whole multiplication in one shot, unless you use a huge workstation. Like hbrerkere said, your initial consumption is about 22 GB. So you either be ready for that, or find another way.
If you don't have such a workstation, another way is to do the multiplication yourself, and parallelize it. Here's how you do it:
Don't load the whole matrix into memory, but load parts of it.
Load like a million rows of X, and store it somewhere.
Load a million columns of Y
Use std::transform with the binary operator std::multiplies to multiply the parts you loaded (this will utilize your processor's vectorization, and make it fast), and fill in the partial result you calculated.
Load the next part of your matrices, and repeat
This won't be as efficient, but it will work. Also another option is to consider using Armadillo after decomposing your matrix to smaller matrices, whose multiplication will yield sub-results.
Both methods are much slower than the full multiplication for 2 reasons:
The overhead of loading and deleting data from memory
Matrix multiplication is already an O(N^3) problem... and now splitting your multiplication is O(N^2), so it'll become O(N^6)...
Good luck!

You can compute the weights using far less memory using the QR decomposition (You might want to look up 'least squares QR');
Briefly:
Use householder transformations to (implicitly) find orthogonal Q so that
Q'*X = R where R is upper triangular
and at the same time transform Y
Q'*Y = y
Solve
R*y = W for W using only the top 64 rows of R and y
If you are willing to overwrite Z and Y, then this requires no extra memory; otherwise you will need a copy of X and a copy of Y.

Related

Cache-Friendly Matrix, for access to adjacent cells

Current Design
In my program I have a big 2-D grid (1000 x 1000, or more), each cell contains a small information.
In order to represent this concept the choice is quite trivial: a matrix data structure.
The correspondent code (in C++) is something like:
int w_size_grid = 1000;
int h_size_grid = 1000;
int* matrix = new int[w_size_grid * h_size_grid];
As you can notice I've used a vector, but the principle is the same.
In order to access an element of the grid, we need a function that given a cell in the grid, identified by (x,y), it returns the value stored in that cell.
Mathematically:
f(x,y) -> Z
obviously:
f: Z^2 -> Z where Z is the set of integer numbers.
That can be trivially achieved with a linear function. Here a C++ code representation:
int get_value(int x, int y) {
return matrix[y*w_size_grid + x];
}
Additional Implementation Notes
Actually the design concept requires a sort of "circular-continuous grid": the access indices for the cell can go out the limits of the grid itself.
That means, for example, the particular case: get_value(-1, -1); is still valid. The function will just return the same value as get_value(w_size_grid - 1, h_size_grid -1);.
Actually this is no a problem in the implementation:
int get_value(int x, int y) {
adjust_xy(&x, &y); // modify x and y in accordance with that rule.
return matrix[y*w_size_grid + x];
}
Anyway this is just an additional note in order to make the scenario more clear.
What is the problem?
The problem presented above is very trivial and simple to design and to implement.
My problem comes with the fact that the matrix is updated with an high frequency. Each cell in the matrix is read and possibly updated with a new value.
Obviously the parsing of the matrix is done with two loop in according to a cache-friend design:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
}
}
The inner cycle is x since [x-1] [x] [x+1] are stored contiguously. Indeed, that cycle exploits principle of locality.
The problem comes now because, actually in order to update a value in a cell, it depends on values in the adjacent cells.
Each cell has exactly eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent.
(-1,-1) | (0,-1) | (1,-1)
-------------------------
(-1,0) | (0,0) | (0, 1)
-------------------------
(-1,1) | (0,1) | (1,1)
So the code is intuitively:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
auto values = get_value_all_neighbours(x, y); // values are 8 integer
}
}
The function get_value_all_neighbours(x,y) will access one row up and one row down in the matrix relatively to y.
Since a row in the matrix is quite big, it involves a cache miss and it dirties the caches themselves.
The Question
One I have finally present you the scenario and the problem, my question is how to "solve" the problem.
Using some additional data structures, or reorganizing the data is there a way to exploit the caches and to avoid all those miss?
Some Personal Consideration
My feelings guide me toward a strategic data structure.
I've thought about a reimplementation of the order in which the values are stored in the vector, trying to stored in contiguous indices those cell which are neighbours.
That implies a no-more-linear function for get_value.
After some thinking, I believe is not possible to find this no-linear function.
I've also thought some additional data-structure like hash-table to store adjacent value for each cell, but I think is an overkill more in space and maybe in CPU cycle also.
Lets assume you have indeed a problem with cache misses that can't easily be avoided (referring to other answers here).
You could use a space filling curve to organize your data in a cache friendly way. Essentially, space filling curves map a volume or plane (such as your matrix) to a linear representations, such that values that are close together in space end (mostly) up close together in the linear representation. In effect, if you store the matrix in a z-ordered array, neighbouring elements have a high likelihood of being on the same memory page.
The best proximity mapping is available with the Hilbert Curve, but it is expensive to calculate. A better option may be a z-curve (Morton-Order). It provides good proximity and is fast to calculate.
Z-Curve: Essentially, to get the ordering, you have to interleave the bits of your x and y coordinate into a single value, called 'z-value'. This z-value determines the position in your list of matrix values (or even simply the index in an array if you use an array). The z-values are consecutive for a completely filled matrix (every cell is used). Conversely, you can de-interleave the position in the list (=array index) and get your x/y coordinates back.
Interleaving bits of two values is quite fast, there are even dedicated CPU instructions to do this with few cycles. If you can't find these (I can't, at the moment), you can simply use some bit twiddling tricks.
Actually, the data structure is not trivial, especially when optimizations are concerned.
There are two main issues to resolve: data content and data usage. Data content are the values in the data and the usage is how the data is stored, retrieved and how often.
Data Content
Are all the values accessed? Frequently?
Data that is not accessed frequently can be pushed to slower media, including files. Leave the fast memory (such as data caches) for the frequently accessed data.
Is the data similar? Are there patterns?
There are alternative methods for representing matrices where a lot of the data is the same (such as a sparse matrix or a lower triangular matrix). For large matrices, maybe performing some checks and returning constant values may be faster or more efficient.
Data Usage
Data usage is a key factor in determining an efficient structure for the data. Even with matrices.
For example, for frequently access data, a map or associative array may be faster.
Sometimes, using many local variables (i.e. registers) may be more efficient for processing matrix data. For example, load up registers with values first (data fetches), operate using the registers, then store the registers back into memory. For most processors, registers are the fastest media for holding data.
The data may want to be rearranged to make efficient use of data cache's and cache lines. The data cache is a high speed area of memory very close to the processor core. A cache line is one row of data in the data cache. An efficient matrix can fit one or more row per cache line.
The more efficient method is to perform as many accesses to a data cache line as possible. Prefer to reduce the need to reload the data cache (because an index was out of range).
Can the operations be performed independently?
For example, scaling a matrix, where each location is multiplied by a value. These operations don't depend on other cells of the matrix. This allows the operations to be performed in parallel. If they can be performed in parallel, then they can be delegated to processors with multiple cores (such as GPUs).
Summary
When a program is data driven, the choice of data structures is not trivial. The content and usage are important factors when choosing a structure for the data and how the data is aligned. Usage and performance requirements will also determine the best algorithms for accessing the data. There are already many articles on the internet for optimizing for data driven applications and best usage of data caches.

C++ - Performance of static arrays, with variable size at launch

I wrote a cellular automaton program that stores data in a matrix (an array of arrays). For a 300*200 matrix I can achieve 60 or more iterations per second using static memory allocation (e.g. std::array).
I would like to produce matrices of different sizes without recompiling the program every time, i.e. the user enters a size and then the simulation for that matrix size begins. However, if I use dynamic memory allocation, (e.g. std::vector), the simulation drops to ~2 iterations per second. How can I solve this problem? One option I've resorted to is to pre-allocate a static array larger than what I anticipate the user will select (e.g. 2000*2000), but this seems wasteful and still limits user choice to some degree.
I'm wondering if I can either
a) allocate memory once and then somehow "freeze" it for ordinary static array performance?
b) or perform more efficient operations on the std::vector? For reference, I am only performing matrix[x][y] == 1 and matrix[x][y] = 1 operations on the matrix.
According to this question/answer, there is no difference in performance between std::vector or using pointers.
EDIT:
I've rewritten the matrix, as per UmNyobe' suggestion, to be a single array, accessed via matrix[y*size_x + x]. Using dynamic memory allocation (sized once at launch), I double the performance to 5 iterations per second.
As per PaulMcKenzie's comment, I compiled a release build and got the performance I was looking for (60 or more iterations per second). However, this is the foundation for more, so I still want to quantify the benefit of one method over the other more thoroughly, so I used a std::chrono::high_resolution_clock to time each iteration, and found that the performance difference between dynamic and static arrays (after using a single array matrix representation) to be within the margin of error (450~600 microseconds per iteration).
The performance during debugging is a slight concern however, so I think I'll keep both, and switch to a static array when debugging.
For reference, I am only performing
matrix[x][y]
Red flag! Are you using vector<vector<int>>for your matrix
representation? This is a mistake, as rows of your matrix will be far
apart in memory. You should use a single vector of size rows x cols
and use matrix[y * row + x]
Furthermore, you should follow the approach where you index first by rows and then by columns, ie matrix[y][x] rather than matrix[x][y]. Your algorithm should also process the same way. This is due to the fact that with matrix[y][x] (x, y) and (x + 1, y) are one memory block from each other while with any other mechanism elements (x,y) and (x + 1, y), (x, y + 1) are much farther away.
Even if there is performance decrease from std::array to std::vector (as the array can have its elements on the stack, which is faster), a decent algorithm will perform on the same magnitude using both collections.

How To Optimizing Memory Hits in Matrix Computations by Setting Contiguous Blocks by Row or by Column

I am looking to optimize memory-hits by how I store the memory of a two-dimensional matrix. I was going to fold the 2D matrix into a 1D contiguous block and was wondering if it would make more sense to store the data as consecutive blocks by row or by column. The type of operations I am considering are the more expensive operations such as multiplication and SVD. Note that I am considering the implementation in C++.
Clarification on Configuration
By consecutive rows or columns I mean the following. Consider a 3x3 matrix
[a11 a12 a13]
[a21 a22 a23]
[a31 a32 a33]
Would it make more sense to store the matrix by row
[[a11 a12 a13] [a21 a22 a23] [a31 a32 a33]]
and then each element at [i,j] would be accessed as [i*nCol + j] and any element a[i,j] is closer in memory to a[i,j+1] than to a[i+1,j]
Or by column?
[[a11 a21 a31] [a12 a22 a32] [a13 a23 a33]]
and then each element at [i,j] would be accessed as [j*nRow + i] and any element a[i,j] is closer in memory to a[i+1,j] than a[i,j+1]
Now say we had a cache that only loaded blocks of three doubles at once. In the first case, to access a11, a12, and a13 would require loading one block. In the second case, to access a11, a13, and a13 would require loading three blocks. This may not be an issue for a 3x3 matrix where both cases need to load three blocks to complete the computation, and all three could easily fit inside our cache at a time, but this might become an issue when we have very large matrices where you cannot fit the entire matrix in your cache at one time.
Intuitive Response
I have done some research on storing a 2D matrix as a 1D array such as:
1d-or-2d-array-whats-faster
And also about the operators involved in matrix multiplication such as
Wikipedia article on Matrix Multiplication
and the associated Strassen Algorithm.
It seems that due to the nature of matrix multiplication, you traverse one matrix by row and the other by column. Intuitively, I would think that what performance you gain from storing the data in one configuration over the other is lost in this specific operation.
ie. Consider multiplying a two 2x2 matrices C = AB, where A is NxM and B is MxL
c[i,k] = sum(a[i,m] * b[m,k]) for m = [1...M]
You are accessing the data in rows in the left matrix and in columns in the right matrix, so you have no advantage of storing the data closer together as what is better for one matrix is worse for the other.
Considering the computationally expensive operations on matrices, would one of these configurations would be better in terms of memory access? Or is this a non-issue considering that practical large-scale matrix multiplication is done on GPUs or a similar configuration? Or is the cost of loading blocks of memory over-shadowed by something else?
The standard way of modelling a non-sparse matrix is to use a contiguous memory block.
Where you deviate though is your attempt to build a matrix class of you own from scratch. I recommend you use an established library such as BLAS (you can get this as a boost package). It is unlikely that you'll beat the optimisations made in that library unless you have a great deal of free time.
Matrix multiplication itself is, as you correctly point out, such that arranging the contiguous memory will advantage either the left or the right matrix. Determinant evaluation is similar. Really, defer such considerations to a 3rd party, well-tested library.

How to optimize this product of three matrices in C++ for x86?

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:
A*A'*Y, where: A is an m-by-n matrix,
A' is its conjugate transpose,
Y is an m-by-k matrix
Typical characteristics:
- k is much smaller than both m or n (k is typically < 10)
- m in the range [500, 2000]
- n in the range [100, 1000]
Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.
My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.
However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:
Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.
A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.
Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.
Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).
At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.
This is not a full answer (yet - and I'm not sure it will become one).
Let's think of the math first a little. Since matrix multiplication is associative we can either do
(A*A')Y or A(A'*Y).
Floating point operations for (A*A')*Y
2*m*n*m + 2*m*m*k //the twos come from addition and multiplication
Floating point operations for A*(A'*Y)
2*m*n*k + 2*m*n*k = 4*m*n*k
Since k is much smaller than m and n it's clear why the second case is much faster.
But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to
m*n*m + 2*m*m*k.
We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:
z*z*z + 2*z*z*k = 4*z*z*k //now simplify
z = 2*k.
So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.
In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).
I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.
You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is
frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)
For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.
Use a temporary matrix to compute A'*Y, but make sure you tell eigen that there's no aliasing going on: temp.noalias() = A.adjoint()*Y. Then compute your result, once again telling eigen that objects aren't aliased: result.noalias() = A*temp.
There would be redundant computation only if you would perform (A*A')*Y since in this case (A*A') is symmetric and only half of the computation are required. However, as you noticed it is still much faster to perform A*(A'*Y) in which case there is no redundant computations. I confirm that the cost of the temporary creation is completely negligible here.
I guess that perform the following
result = A * (A.adjoint() * Y)
will be the same as do that
temp = A.adjoint() * Y
result = A * temp;
If your matrix Y fits in the cache, you can probably take advantage of doing it like that
result = A * (Y.adjoint() * A).adjoint()
or, if the previous notation is not allowed, like that
temp = Y.adjoint() * A
result = A * temp.adjoint();
Then you dont need to do the adjoint of matrix A, and store the temporary adjoint matrix for A, that will be much expensive that the one for Y.
If your matrix Y fits in the cache, it should be much faster doing a loop runing over the colums of A for the first multiplication, and then over the rows of A for the second multimplication (having Y.adjoint() in the cache for the first multiplication and temp.adjoint() for the second), but I guess that internally eigen is already taking care of that things.

inverse fourier transform FFT3W

I am using C++ function to find inverse Fourier transform.
int inYSize = 170; int inXSize = 2280;
float* outData = new float[inYSize*inXSize];
fftwf_plan mReverse = fftwf_plan_dft_c2r_2d(inYSize, inXSize,(fftwf_complex*)temp, outdata,
FFTW_ESTIMATE);
fftwf_execute(mReverse);
My input is 2D array temp with complex numbers. All the elements have real value 1 and imaginary 0.
So I am expecting InverseFFT of such an array should be 2D array with real values. Output array should have SPIKE at 0,0 and rest all values 0. But I am getting all different values in the output array even after normalizing with total size of an array. What could be the reason?
FFTW is not that trivial to deal with when it comes to multidimensional DFT and Complex to Real transform.
When doing a C2R transform of a MxN row-major array, the second dimension is cut in half because of the symmetry of the result : outData is twice bigger than needed, but it's not the reason of your problem (and not you're case as you are doing C2R and not R2C).
More info about this tortuous matter : http://www.fftw.org/doc/One_002dDimensional-DFTs-of-Real-Data.html
"Good Guy Advice" : Use only the C2C "easier" way of doing things, take the modulus of the output if you don't know how to process the results, but don't waste your time on n-D Complex to Real transforms.
Because of limited precision, because of the numerical implementation of the DFT, because of unsubordinated drunk bits, you can get values that are not 0 even if they are very small. This is the normal behavior of a FFT algorithm.
Besides reading carefully the user manual (http://www.fftw.org/doc/) even if it's a real pain (I lost few days around this library just to get a 3D transform working, just to understand how data was scaled)
You should try with a C2C 1D transform before going C2C 2D and C2R 2D, just to be sure you have somehow an idea of what you're doing.
What's the inverse FFT of a planar constant something where every bin of the "frequency-plane" is filled with a one ? Are you looking for a new way to define +inf or -inf ? In that case I would rather start with the easier division by 0 ^^. The direct FFT should be a as you described, with the SPIKE correctly scaled being 1, pretty sure the inverse is not.
Do not hesitate to add precision to your question, and good luck with FFTW
With this little information it is hard to tell. What i could imagine would be that you get spectral leakage due to the window selection (See This Wikipedia article for details about leakage).
What you could do is try using another windowing function to reduce leakage or redefine your windowing size.