Set sparsity pattern of Eigen::SparseMatrix without memory overhead - c++

I need to set sparsity pattern of Eigen::SparseMatrix which i already know (i have unique sorted column indices and row offsets). Obviously it's possible via setFromTriplets but unfortunately setFromTriplets requires a lot of additional memory (at least in my case)
I wrote small example
const long nRows = 5000000;
const long nCols = 100000;
const long nCols2Skip = 1000;
//It's quite big!
const long nTriplets2Reserve = nRows * (nCols / nCols2Skip) * 1.1;
Eigen::SparseMatrix<double, Eigen::RowMajor, long> mat(nRows, nCols);
std::vector<Eigen::Triplet<double, long>> triplets;
triplets.reserve(nTriplets2Reserve);
for(long row = 0; row < nRows; ++row){
for(long col = 0; col < nCols; col += nCols2Skip){
triplets.push_back(Eigen::Triplet<double, long>(row, col, 1));
}
}
std::cout << "filling mat" << std::endl << std::flush;
mat.setFromTriplets(triplets.begin(), triplets.end());
std::cout << "Finished! nnz " << mat.nonZeros() << std::endl;
//Stupid way to check memory consumption
std::cin.get();
In my case this example consumes something about 26Gb at peak (between lines "filling mat" and "Finished") and 18Gb after all. (I made all checks via htop). ~8Gb overhead is quite big for me (in my "real world" task i have bigger overhead).
So i have two questions:
How to fill sparsity pattern of Eigen::SparseMatrix with as little overhead as possible
Why setFromTriplets requires so many memory?
Please let me know if my example is wrong.
My Eigen version is 3.3.2
PS Sorry for my English
EDIT:
It's looks like inserting (with preallocation) each triplet manually works faster and requires less memory at peak. But i still want to know is it possible to set sparsity pattern manually

Ad 1: You can be even a bit more efficient than plain insert by using the internal functions startVec and insertBack, if you can guarantee that you insert elements in lexicographical order.
Ad 2: If you use setFromTriplets you need approximately twice final the size of the matrix (plus the size of your Triplet container), since the elements are first inserted in a transposed version of the matrix, which is then transposed to the final matrix in order to make sure that all inner vectors are sorted. If you know the structure of your matrix ahead, this is obviously quite a waste of memory, but it is intended to work on arbitrary input data.
In your example you have 5000000 * 100000 / 1000 = 5e8 elements. A Triplet requires 8+8+8 = 24 bytes (making about 12Gb for the vector) and each element of the sparse matrix requires 8+8=16 bytes (one double for the value, one long for the inner index), i.e., about 8Gb per matrix, so in total you require about 28Gb which is about 26 Gib.
Bonus:
If your matrix has some special structure, which can be stored more efficiently, and you are willing to dig deeper into the Eigen internals, you may also consider implementing a new type inheriting from Eigen::SparseBase<> (but I don't recomment this, unless memory/performance is very critical for you, and you are willing to go through a lot of "sparsely" documented internal Eigen code ...). However, in that case it is probably easier to think about what you intend to do with your matrix, and try to implement only special operations for that.

Related

If I have two functions copyij() and copyji(), to copy a 2048x2048 integer array, which one is faster and why? [duplicate]

Here is simple C++ code that compare iterating 2D array row major with column major.
#include <iostream>
#include <ctime>
using namespace std;
const int d = 10000;
int** A = new int* [d];
int main(int argc, const char * argv[]) {
for(int i = 0; i < d; ++i)
A[i] = new int [d];
clock_t ColMajor = clock();
for(int b = 0; b < d; ++b)
for(int a = 0; a < d; ++a)
A[a][b]++;
double col = static_cast<double>(clock() - ColMajor) / CLOCKS_PER_SEC;
clock_t RowMajor = clock();
for(int a = 0; a < d; ++a)
for(int b = 0; b < d; ++b)
A[a][b]++;
double row = static_cast<double>(clock() - RowMajor) / CLOCKS_PER_SEC;
cout << "Row Major : " << row;
cout << "\nColumn Major : " << col;
return 0;
}
Result for different values of d:
d = 10^3 :
Row Major : 0.002431
Column Major : 0.017186
d = 10^4 :
Row Major : 0.237995
Column Major : 2.04471
d = 10^5
Row Major : 53.9561
Column Major : 444.339
Now the question is why row major is faster than column major?
It obviously depends on the machine you're on but very generally speaking:
Your computer stores parts of your program's memory in a cache that has a much smaller latency than main memory (even when compensating for cache hit time).
C arrays are stored in a contiguous by row major order. This means if you ask for element x, then element x+1 is stored in main memory at a location directly following where x is stored.
It's typical for your computer cache to "pre-emptively" fill cache with memory addresses that haven't been used yet, but that are locally close to memory that your program has used already. Think of your computer as saying: "well, you wanted memory at address X so I am going to assume that you will shortly want memory at X+1, therefore I will pre-emptively grab that for you and place it in your cache".
When you enumerate your array via row major order, you're enumerating it in such a way where it's stored in a contiguous manner in memory, and your machine has already taken the liberty of pre-loading those addresses into cache for you because it guessed that you wanted it. Therefore you achieve a higher rate of cache hits. When you're enumerating an array in another non-contiguous manner then your machine likely won't predict the memory access pattern you're applying, so it wont be able to pre-emptively pull memory addresses into cache for you, and you won't incur as many cache hits, so main memory will have to be accessed more frequently which is slower than your cache.
Also, this might be better suited for https://cs.stackexchange.com/ because the way your system cache behaves is implemented in hardware, and spatial locality questions seem better suited there.
Your array is actually a ragged array, so row major isn't entirely a factor.
You're seeing better performance iterating over columns then rows because the row memory is laid out linearly, which reading sequentially is easy for the cache predictor to predict, and you amortize the pointer dereference to the second dimension since it only needs to be done once per row.
When you iterate over the rows then columns, you incur a pointer dereference to the second dimension per iteration. So by iterating over rows, you're adding a pointer dereference. Aside from the intrinsic cost, it's bad for cache prediction.
If you want a true two-dimensional array, laid out in memory using row-major ordering, you would want...
int A[1000][1000];
This lays out the memory contiguously in row-major order, instead of one array of pointers to arrays (which are not laid out contiguously). Iterating over this array using row-major would still perform faster than iterating column-major because of spatial locality and cache prediction.
The short answer is CPU caches.
Scott Mayers explains it very clearly here

Computing size of symmetric difference of two sorted arrays using SIMD AVX

I am looking for a way to optimize an algorithm that I am working on. It's most repetitive and thus compute-intensive part is comparison of two sorted arrays of any size, containing unique unsigned integer (uint32_t) values in order to obtain the size of symmetric difference of them (number of elements that exist only in one of the vectors). The target machine on which the algorithm will be deployed uses Intel processors supporting AVX2, therefore I am looking for a way to perform it in-place using SIMD.
Is there a way to exploit the AVX2 instructions to obtain the size of symmetric difference of two sorted arrays of unsigned integers?
Since both arrays are sorted it should be fairly easy to implement this algorithm using SIMD (AVX2). You would just need to iterate through the two arrays concurrently, and then when you get a mismatch when comparing two vectors of 8 ints you would need to resolve the mismatch, i.e. count the differences, and get the two array indices back in phase, and continue until you get to the end of one of the arrays. Then just add the no of remaining elements in the other array, if any, to get the final count.
Unless your arrays are tiny (like <=16 elements), you need to perform merge of the two sorted arrays with additional code for dumping non-equal elements.
If the size of symmetric difference is expected to be very small, then use the method described by PaulR.
If the size is expected to be high (like 10% of total number of elements), then you will have real trouble with vectorizing it. It is much easier to optimize scalar solution.
After writing several versions of code, the fastest one for me is:
int Merge3(const int *aArr, int aCnt, const int *bArr, int bCnt, int *dst) {
int i = 0, j = 0, k = 0;
while (i < aCnt - 32 && j < bCnt - 32) {
for (int t = 0; t < 32; t++) {
int aX = aArr[i], bX = bArr[j];
dst[k] = (aX < bX ? aX : bX);
k += (aX != bX);
i += (aX <= bX);
j += (aX >= bX);
}
}
while (i < aCnt && j < bCnt) {
... //use simple code to merge tails
The main optimizations here are:
Perform merging iterations in blocks (32 iterations per block). This allows to simplify stop criterion from (i < aCnt && j < bCnt) to t < 32. This can be done for most of the elements, and the tails can be processed with slow code.
Perform iterations in branchless fashion. Note that ternary operator is compiled into cmov instruction, and comparisons are compiled into setXX instructions, so there are no branches in the loop body. The output data is stored with the well-known trick: write all elements, but increase pointer only for the valid ones.
What else I have tried:
(no vectorization) perform (4 + 4) bitonic merge, check consecutive elements for duplicates, move pointers so that 4 min elements (in total) are skipped:
gets 4.95ns vs 4.65ns --- slightly worse.
(fully vectorized) compare 4 x 4 elements pairwise, extract comparison results into 16-bit mask, pass it through perfect hash function, use _mm256_permutevar8x32_epi32 with 128-entry LUT to get sorted 8 elements, check consecutive elements for duplicates, use _mm_movemask_ps + 16-entry LUT + _mm_shuffle_epi8 to store only unique elements among minimal 4 elements: gets 4.00ns vs 4.65ns --- slightly better.
Here is the file with solutions and file with perfect hash + LUT generator.
P.S. Note that similar problem for intersection of sets is solved here. The solution is somewhat similar to what I outlined as point 2 above.

Why is iterating 2D array row major faster than column major?

Here is simple C++ code that compare iterating 2D array row major with column major.
#include <iostream>
#include <ctime>
using namespace std;
const int d = 10000;
int** A = new int* [d];
int main(int argc, const char * argv[]) {
for(int i = 0; i < d; ++i)
A[i] = new int [d];
clock_t ColMajor = clock();
for(int b = 0; b < d; ++b)
for(int a = 0; a < d; ++a)
A[a][b]++;
double col = static_cast<double>(clock() - ColMajor) / CLOCKS_PER_SEC;
clock_t RowMajor = clock();
for(int a = 0; a < d; ++a)
for(int b = 0; b < d; ++b)
A[a][b]++;
double row = static_cast<double>(clock() - RowMajor) / CLOCKS_PER_SEC;
cout << "Row Major : " << row;
cout << "\nColumn Major : " << col;
return 0;
}
Result for different values of d:
d = 10^3 :
Row Major : 0.002431
Column Major : 0.017186
d = 10^4 :
Row Major : 0.237995
Column Major : 2.04471
d = 10^5
Row Major : 53.9561
Column Major : 444.339
Now the question is why row major is faster than column major?
It obviously depends on the machine you're on but very generally speaking:
Your computer stores parts of your program's memory in a cache that has a much smaller latency than main memory (even when compensating for cache hit time).
C arrays are stored in a contiguous by row major order. This means if you ask for element x, then element x+1 is stored in main memory at a location directly following where x is stored.
It's typical for your computer cache to "pre-emptively" fill cache with memory addresses that haven't been used yet, but that are locally close to memory that your program has used already. Think of your computer as saying: "well, you wanted memory at address X so I am going to assume that you will shortly want memory at X+1, therefore I will pre-emptively grab that for you and place it in your cache".
When you enumerate your array via row major order, you're enumerating it in such a way where it's stored in a contiguous manner in memory, and your machine has already taken the liberty of pre-loading those addresses into cache for you because it guessed that you wanted it. Therefore you achieve a higher rate of cache hits. When you're enumerating an array in another non-contiguous manner then your machine likely won't predict the memory access pattern you're applying, so it wont be able to pre-emptively pull memory addresses into cache for you, and you won't incur as many cache hits, so main memory will have to be accessed more frequently which is slower than your cache.
Also, this might be better suited for https://cs.stackexchange.com/ because the way your system cache behaves is implemented in hardware, and spatial locality questions seem better suited there.
Your array is actually a ragged array, so row major isn't entirely a factor.
You're seeing better performance iterating over columns then rows because the row memory is laid out linearly, which reading sequentially is easy for the cache predictor to predict, and you amortize the pointer dereference to the second dimension since it only needs to be done once per row.
When you iterate over the rows then columns, you incur a pointer dereference to the second dimension per iteration. So by iterating over rows, you're adding a pointer dereference. Aside from the intrinsic cost, it's bad for cache prediction.
If you want a true two-dimensional array, laid out in memory using row-major ordering, you would want...
int A[1000][1000];
This lays out the memory contiguously in row-major order, instead of one array of pointers to arrays (which are not laid out contiguously). Iterating over this array using row-major would still perform faster than iterating column-major because of spatial locality and cache prediction.
The short answer is CPU caches.
Scott Mayers explains it very clearly here

Use map instead of array in C++ to protect searching outside of array bounds?

I have a gridded rectangular file that I have read into an array. This gridded file contains data values and NODATA values; the data values make up a continuous odd shape inside of the array, with NODATA values filling in the rest to keep the gridded file rectangular. I perform operations on the data values and skip the NODATA values.
The operations I perform on the data values consist of examining the 8 surrounding neighbors (the current cell is the center of a 3x3 grid). I can handle when any of the eight neighbors are NODATA values, but when actual data values fall in the first or last row/column, I trigger an error by trying to access an array value that doesn't exist.
To get around this I have considered three options:
Add a new first and last row/column with NODATA values, and adjust my code accordingly - I can cycle through the internal 'original' array and handle the new NODATA values like the edges I'm already handling that don't fall in the first and last row/column.
I can create specific processes for handling the cells in first and last row/column that have data - modified for loops (a for loop that steps through a specific sequence/range) that only examine the surrounding cells that exist, though since I still need 8 neighboring values (NODATA/non-existent cells are given the same value as the central cell) I would have to copy blank/NODATA values to a secondary 3x3 grid. Though there maybe a way to avoid the secondary grid. This solution is annoying as I have to code up specialized routines to all corner cells (4 different for loops) and any cell in the 1st or last row/column (another 4 different for loops). With a single for loop for any non-edge cell.
Use a map, which based on my reading, appears capable of storing the original array while letting me search for locations outside the array without triggering an error. In this case, I still have to give these non-existent cells a value (equal to the center of the array) and so may or may not have to set up a secondary 3x3 grid as well; once again there maybe a way to avoid the secondary grid.
Solution 1 seems the simplest, solution 3 the most clever, and 2 the most annoying. Are there any solutions I'm missing? Or does one of these solutions deserve to be the clear winner?
My advice is to replace all read accesses to the array by a function. For example, arr[i][j] by getarr(i,j). That way, all your algorithmic code stays more or less unchanged and you can easily return NODATA for indices outside bounds.
But I must admit that it is only my opinion.
I've had to do this before and the fastest solution was to expand the region with NODATA values and iterate over the interior. This way the core loop is simple for the compiler to optimize.
If this is not a computational hot-spot in the code, I'd go with Serge's approach instead though.
To minimize rippling effects I used an array structure with explicit row/column strides, something like this:
class Grid {
private:
shared_ptr<vector<double>> data;
int origin;
int xStride;
int yStride;
public:
Grid(int nx, int ny) :
data( new vector<double>(nx*ny) ),
origin(0),
xStride(1),
yStride(nx) {
}
Grid(int nx, int ny, int padx, int pady) :
data( new vector<double>((nx+2*padx)*(ny+2*pady));
xStride(1),
yStride(nx+2*padx),
origin(nx+3*padx) {
}
double& operator()(int x, int y) {
return (*data)[origin + x*xStride + y*yStride];
}
}
Now you can do
Grid g(5,5,1,1);
Grid g2(5,5);
//Initialise
for(int i=0; i<5; ++i) {
for(int j=0; j<5; ++j) {
g(i,j)=i+j;
}
}
// Convolve (note we don't care about going outside the
// range, and our indices are unchanged between the two
// grids.
for(int i=0; i<5; ++i) {
for(int j=0; j<5; ++j) {
g2(i,j)=0;
g2(i,j)+=g(i-1,j);
g2(i,j)+=g(i+1,j);
g2(i,j)+=g(i,j-1);
g2(i,j)+=g(i,j+1);
}
}
Aside: This data structure is awesome for working with transposes, and sub-matrices. Each of those is just an adjustment of the offset and stride values.
Solution 1 is the standard solution. It takes maximum advantage of modern computer architectures, where a few bytes of memory are no big deal, and correct instruction prediction accelerates performance. As you keep accessing memory in a predictable pattern (with fixed strides), the CPU prefetcher will successfully read ahead.
Solution 2 saves a small amount of memory, but the special handling of the edges incurs a real slowdown. Still, the large chunk in the middle benefits from the prefetcher.
Solution 3 is horrible. Map access is O(log N) instead of O(1), and in practice it can be 10-20 times slower. Maps have poor locality of reference; the CPU prefetcher will not kick in.
If simple means "easy to read" I'd recommend you declare a class with an overloaded [] operator. Use it like a regular array but it'll have bounds checking to handle NODATA.
If simple means "high performance" and you have sparse grid with isolated DATA consider implementing linked lists to the DATA values and implement optimal operators that go directly to tge DATA values.
1 wastes memory proportional to your overall rectangle size, 3/maps are clumsy here, 2 is actually very easy to do:
T d[X][Y] = ...;
for (int x = 0; x < X; ++x)
for (int y = 0; y < Y; ++y) // move over d[x][y] centres
{
T r[3][3] = { { d[i,j], d[i,j], d[i,j] },
d[i,j], d[i,j], d[i,j] },
d[i,j], d[i,j], d[i,j] } };
for (int i = std::min(0, x-1); i < std::max(X-1, x+1); ++i)
for (int j = std::min(0, y-1); j < std::max(Y-1, y+1); ++j)
if (d[i][j] != NoData)
r[i-x][j-y] = d[i][j];
// use r for whatever...
}
Note that I'm using signed int very deliberately so x-1 and y-1 don't become huge positive numbers (as they would with say size_t) and break the std::min logic... but you could express it differently if you had some reason to prefer size_t (e.g. x == 0 ? 0 : x - 1).

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).