Vectorize a Symmetric Matrix - c++

I would like to write a function with the following signature
VectorXd vectorize (const MatrixXd&);
which returns the contents of a symmetric matrix in VectorXd form, without repeated elements. For example,
int n = 3; // n may be much larger in practice.
MatrixXd sym(n, n);
sym << 9, 2, 3,
2, 8, 4,
3, 4, 7;
std::cout << vectorize(sym) << std::endl;
should return:
9
2
3
8
4
7
The order of elements within vec is not important, provided it is systematic. What is important for my purposes is to return the data of sym without the repeated elements, because sym is always assumed to be symmetric. That is, I want to return the elements of the upper or lower triangular "view" of sym in VectorXd form.
I have naively implemented vectorize with nested for loops, but this function may be called very often within my program (over 1 million times). My question is thus: what is the most computationally efficient way to write vectorize? I was hoping to use Eigen's triangularView, but I do not see how.
Thank you in advance.

Regarding efficiency, you could write a single for loop with column-wise (and thus vectorized) copies:
VectorXd res(mat.rows()*(mat.cols()+1)/2);
Index size = mat.rows();
Index offset = 0;
for(Index j=0; j<mat.cols(); ++j) {
res.segment(offset,size) = mat.col(j).tail(size);
offset += size;
size--;
}
In practice, I expect that the compiler already fully vectorized your nested loop, and thus speed should be roughly the same.

Related

Performance optimization nested loops

I am implementing a rather complicated code and in one of the critical sections I need to basically consider all the possible strings of numbers following a certain rule. The naive implementation to explain what I do would be such a nested loop implementation:
std::array<int,3> max = { 3, 4, 6};
for(int i = 0; i <= max.at(0); ++i){
for(int j = 0; j <= max.at(1); ++j){
for(int k = 0; k <= max.at(2); ++k){
DoSomething(i, j, k);
}
}
}
Obviously I actually need more nested for and the "max" rule is more complicated but the idea is clear I think.
I implemented this idea using a recursive function approach:
std::array<int,3> max = { 3, 4, 6};
std::array<int,3> index = {0, 0, 0};
int total_depth = 3;
recursive_nested_for(0, index, max, total_depth);
where
void recursive_nested_for(int depth, std::array<int,3>& index,
std::array<int,3>& max, int total_depth)
{
if(depth != total_depth){
for(int i = 0; i <= max.at(depth); ++i){
index.at(depth) = i;
recursive_nested_for(depth+1, index, max, total_depth);
}
}
else
DoSomething(index);
}
In order to save as much as possible I declare all the variable I use global in the actual code.
Since this part of the code takes really long is it possible to do anything to speed it up?
I would also be open to write 24 nested for if necessary to avoid the overhead at least!
I thought that maybe an approach like expressions templates to actually generate at compile time these nested for could be more elegant. But is it possible?
Any suggestion would be greatly appreciated.
Thanks to all.
The recursive_nested_for() is a nice idea. It's a bit inflexible as it is currently written. However, you could use std::vector<int> for the array dimensions and indices, or make it a template to handle any size std::array<>. The compiler might be able to inline all recursive calls if it knows how deep the recursion is, and then it will probably be just as efficient as the three nested for-loops.
Another option is to use a single for loop for incrementing the indices that need incrementing:
void nested_for(std::array<int,3>& index, std::array<int,3>& max)
{
while (index.at(2) < max.at(2)) {
DoSomething(index);
// Increment indices
for (int i = 0; i < 3; ++i) {
if (++index.at(i) >= max.at(i))
index.at(i) = 0;
else
break;
}
}
}
However, you can also consider creating a linear sequence that visits all possible combinations of the iterators i, j, k and so on. For example, with array dimensions {3, 4, 6}, there are 3 * 4 * 6 = 72 possible combinations. So you can have a single counter going from 0 to 72, and then "split" that counter into the three iterator values you need, like so:
for (int c = 0; c < 72; c++) {
int k = c % 6;
int j = (c / 6) % 4;
int i = c / 6 / 4;
DoSomething(i, j, k);
}
You can generalize this to as many dimensions as you want. Of course, the more dimensions you have, the higher the cost of splitting the linear iterator. But if your array dimensions are powers of two, it might be very cheap to do so. Also, it might be that you don't need to split it at all; for example if you are calculating the sum of all elements of a multidimensional array, you don't care about the actual indices i, j, k and so on, you just want to visit all elements once. If the array is layed out linearly in memory, then you just need a linear iterator.
Of course, if you have 24 nested for loops, you'll notice that the product of all the dimension's sizes will become a very large number. If it doesn't fit in a 32 bit integer, your code is going to be very slow. If it doesn't fit into a 64 bit integer anymore, it will never finish.

How to increase value of a k consecutive elements in an vector in c++?

Suppose we have an vector in c++ of size 8 with elements {0, 1, 1, 0, 0, 0, 1, 1} and i want to increase the size of a specific portion of vector by one, for example, lets say the portion of vector which needs to be increase by 1 is 0 to 5, then our final result is {1, 2, 2, 1, 1, 0, 0, 1, 1}.
Is it possible to do this in constant time using standard method of vectors (like we a memset in c), without running any loop?
No... and by the way with memset you don't have a guaranteed constant-time operation either (in most implementation is just very fast but still linear in the number of elements).
If you need to do this kind of operation (addition/subtraction of a constant over a range) on a very huge vector a lot of times and you need to get the final result then you can get O(1) per update using a different algorithm:
Step 1: convert the data to its "derivative"
This mean replacing each element with the difference from previous one.
// O(n) on the size of the vector, but done only once
for (int n=v.size()-1; i>0; i--) {
v[i] -= v[i-1];
}
Step 2: do all the interval operations (each in constant time)
With this representation adding a constant to a range simply means adding it to the first element and subtracting it from the element past the ending one. In code:
// intervals contains structures with start/stop/value fields
// Operation is O(n) on the **number of intervals**, and does
// not depend on the size of them
for (auto r : intervals) {
v[r.start] += r.value;
v[r.stop+1] -= r.value;
}
Step 3: Collect the results
Finally you just need to un-do the initial processing, getting back to the normal values on each cell by integrating. In code:
// O(n) on the size of vector, but done only once
for (int i=1,n=v.size(); i<n; i++) {
v[i] += v[i-1];
}
Note that both step 1 and 3 (derivation and integration) can be done in parallel on N cores with perfect efficiency if the size is large enough, even if how this is possible may be not obvious at a first sight (it wasn't for me, at least).

C++ Optimizing this Algorithm

After watching some Terence Tao videos, I wanted to try implementing algorithms into c++ code to find all the prime numbers up to a number n. In my first version, where I simply had every integer from 2 to n tested to see if they were divisible by anything from 2 to sqrt(n), I got the program to find the primes between 1-10,000,000 in ~52 seconds.
Attempting to optimize the program, and implementing what I now know to be the Sieve of Eratosthenes, I assumed the task would be done much faster than 51 seconds, but sadly, that wasn't the case. Even going up to 1,000,000 took a considerable amount of time (didn't time it, though)
#include <iostream>
#include <vector>
using namespace std;
void main()
{
vector<int> tosieve = {};
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
for (int j = 0; j < tosieve.size(); j++)
{
for (int k = j + 1; k < tosieve.size(); k++)
{
if (tosieve[k] % tosieve[j] == 0)
{
tosieve.erase(tosieve.begin() + k);
}
}
}
//for (int f = 0; f < tosieve.size(); f++)
//{
// cout << (tosieve[f]) << endl;
//}
cout << (tosieve.size()) << endl;
system("pause");
}
Is it the repeated referencing of the vectors or something? Why is this so slow? Even if I'm completely overlooking something (could be, complete beginner at this :I) I would think that finding the primes between 2 and 1,000,000 with this horrible inefficient method would be faster than my original way of finding them from 2 to 10,000,000.
Hope someone has a clear answer to this - hopefully I can use whatever knowledge is gleaned in the future when optimizing programs using a lot of recursion.
The problem is that 'erase' moves every element in the vector down one, meaning it is an O(n) operation.
There are three alternative choices:
1) Just mark deleted elements as 'empty' (make them 0, for example). This will mean future passes have to pass over those empty positions, but that isn't that expensive.
2) Make a new vector, and push_back new values into there.
3) Use std::remove_if: This will move the elements down, but do it in a single pass so will be more efficient. If you use std::remove_if, then you will have to remember it doesn't resize the vector itself.
Most of vector operations, including erase() have a O(n) linear time complexity.
Since you have two loops of size 10^6, and a vector of size 10^6, your algorithm executes up to 10^18 operations.
Qubic algorithms for such a big N will take a huge amount of time.
N = 10^6 is even big enough for quadratic algorithms.
Please, read carefully about Sieve of Eratosthenes. The fact that both full search and Sieve of Eratosthenes algorithms took the same time, means that you have done the second one wrong.
I see two performanse issues here:
First of all, push_back() will have to reallocate the dynamic memory block once in a while. Use reserve():
vector<int> tosieve = {};
tosieve.resreve(1000001);
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
Second erase() has to move all Elements behind the one you try to remove. You set the elements to 0 instead and do a run over the vector in the end (untested code):
for (auto& x : tosieve) {
for (auto y = tosieve.begin(); *y < x; ++y) // this check works only in
// the case of an ordered vector
if (y != 0 && x % y == 0) x = 0;
}
{ // this block will make sure, that sieved will be released afterwards
auto sieved = vector<int>{};
for(auto x : tosieve)
sieved.push_back(x);
swap(tosieve, sieved);
} // the large memory block is released now, just keep the sieved elements.
consider to use standard algorithms instead of hand written loops. They help you to state your intent. In this case I see std::transform() for the outer loop of the sieve, std::any_of() for the inner loop, std::generate_n() for filling tosieve at the beginning and std::copy_if() for filling sieved (untested code):
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
transform(begin(tosieve), end(tosieve), begin(tosieve), [](int i) -> int {
return any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return j != 0 && i % j == 0;
}) ? 0 : i;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool { return i != 0; });
return sieved;
});
EDIT:
Yet another way to get that done:
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool {
return !any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return i % j == 0;
});
});
return sieved;
});
Now instead of marking elements, we don't want to copy afterwards, but just directly copy only the elements, we want to copy. This is not only faster than the above suggestion, but also better states the intent.
Very interesting task you have. Thanks!
With pleasure I implemented from scratch my own versions of solving it.
I created 3 separate (independent) functions, all based on Sieve of Eratosthenes. These 3 versions are different in their complexity and speed.
Just a quick note, my simplest (slowest) version finds all primes below your desired limit of 10'000'000 within just 0.025 sec (i.e. 25 milli-seconds).
I also tested all 3 versions to find primes below 2^32 (4'294'967'296), which is solved by "simple" version within 47 seconds, by "intermediate" version within 30 seconds, by "advanced" within 12 seconds. So within just 12 seconds it finds all primes below 4 Billion (there are 203'280'221 such primes below 2^32, see OEIS sequence)!!!
For simplicity I will describe in details only Simple version out of 3. Here's code:
template <typename T>
std::vector<T> GenPrimes_SieveOfEratosthenes(size_t end) {
// https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
if (end <= 2)
return {};
size_t const cnt = end >> 1;
std::vector<u8> composites((cnt + 7) / 8);
auto Get = [&](size_t i){ return bool((composites[i / 8] >> (i % 8)) & 1); };
auto Set = [&](size_t i){ composites[i / 8] |= u8(1) << (i % 8); };
std::vector<T> primes = {2};
size_t i = 0;
for (i = 1; i < cnt; ++i) {
if (Get(i))
continue;
size_t const p = 2 * i + 1, start = (p * p) >> 1;
primes.push_back(p);
if (start >= cnt)
break;
for (size_t j = start; j < cnt; j += p)
Set(j);
}
for (i = i + 1; i < cnt; ++i)
if (!Get(i))
primes.push_back(2 * i + 1);
return primes;
}
This code implements simplest but fast algorithm of finding primes, called Sieve of Eratosthenes. As a small optimization of speed and memory, I search only over odd numbers. This odd numbers optimization gives me ability to store 2x times less memory and do 2x times less steps, hence improves both speed and memory consumption exactly 2 times.
Algorithm is simple, we allocate array of bits, this array at position K has bit 1 if K is composite, or has 0 if K is probably prime. At the end all 0 bits in array signify Definite primes (that are for sure primes). Also due to odd numbers optimization this bit-array stores only odd numbers, so K-th bit is actually a number 2 * K + 1.
Then left to right we go over this array of bits and if we meet 0 bit at position K then it means we found a prime number P = 2 * K + 1 and now starting from position (P * P) / 2 we mark every P-th bit with 1. It means we mark all numbers bigger than P*P that are composite, because they are divisible by P.
We do this procedure only until P * P becomes greater or equal to our limit End (we're finding all primes < End). This limit guarantees that after reaching it ALL zero bits inside array signify prime numbers.
Second version of code does only one optimization to this Simple version, it makes all multi-core (multi-threaded). But this only optimization makes code much bigger and more complex. Basically it slices whole range of bits into all cores, so that they write bits to memory in parallel.
I'll explain only my third Advanced version, it is most complex of 3 versions. It does not only multi-threaded optimization, but also so-called Primorial optimization.
What is Primorial, it is a product of first smallest primes, for example I take primorial 2 * 3 * 5 * 7 = 210.
We can see that any primorial splits infinite range of integers into wheels by modulus of this primorial. For example primorial 210 splits into ranges [0; 210), [210; 2210), [2210; 3*210), etc.
Now it is easy to mathematically prove that inside All ranges of primorial we can mark same positions of numbers as complex, exactly we can mark all numbers that are multiple of 2 or 3 or 5 or 7 as composite.
We can see that out of 210 remainders there are 162 remainders that are for sure composite, and only 48 remainders are probably prime.
Hence it is enough for us to check primality of only 48/210=22.8% of whole search space. This reduction of search space makes task more than 4x times faster, and 4x times less memory consuming.
One can see that my first Simple version in fact due to odd-only optimization was actually using Primorial equal to 2 optimization. Yes, if we take primorial 2 instead of primorial 210, then we gain exactly first version (Simple) algorithm.
All of my 3 versions are tested for correctness and speed. Although still some tiny bugs can remain. Note. Yet it is recommended not to use my code straight away in production, unless it is tested thoroughly.
All 3 versions are tested for correctness by re-using each other answers. I thoroughly test correctness by feeding all limits (end value) from 0 to 2^18. It takes some time to do this.
See main() function to figure out how to use my functions.
Try it online!
SOURCE CODE GOES HERE. Due to StackOverflow limit of 30K symbols per post, I can't inline source code here, as it is almost 30K in size and together with English post above it takes more than 30K. So I'm providing source code on separate Github Gist server, link below. Note that Try it online! link above also contains full source code, but I reduced search limit of 2^32 to smaller one due to GodBolt limit of running time to 3 seconds.
Github Gist code
Output:
10M time 'Simple' 0.024 sec
Time 2^32 'Simple' 46.924 sec, number of primes 203280221
Time 2^32 'Intermediate' 30.999 sec
Time 2^32 'Advanced' 11.359 sec
All checked till 0
All checked till 5000
All checked till 10000
All checked till 15000
All checked till 20000
All checked till 25000

find a matrix in a big matrix

I have a very large n*m matrix S. I want to efficiently determine whether there exists a submatrix F inside of S. The large matrix S can have a size as big as 500*500.
To clarify, consider the following:
S = 1 2 3
4 5 6
7 8 9
F1 = 2 3
5 6
F2 = 1 2
4 6
In such a case:
F1 is inside S
F2 is not inside S
Each element in the matrix is a 32-bit integer. I can only think of using a brute-force approach to find whether F is a submatrix of S. I googled to find an effective algorithm, but I can't find anything.
Is there some algorithm or principle to do it faster? (Or possibly some method to optimize the brute force approach?)
PS the statistics data
A total of 8 S
On average, each S will be matched against about 44 F.
The probability of success match (i.e. F appears in a S) is
19%.
It involves preprocessing the matrix. This will be heavy on memory, but it should be better in terms of computation time.
Check if the size of the sub-matrix is less than that of the matrix before you do the check.
While constructing the matrix, build a construct that maps a value in the matrix to an array of (x,y) positions in the matrix. This will allow you to check for the existence of a sub-matrix where candidates could exist. You would use the value at (0,0) in the sub-matrix and get the possible positions of this value in the larger matrix. If the list of positions is empty, you have no candidates, and so, the sub-matrix does not exist. There's a start (More experienced people might consider this a naive approach however).
Modified Code of deepu-benson
int Ma[][5]= {
{0, 0, 1, 0, 0},
{0, 0, 1, 0, 0},
{0, 1, 0, 0, 0},
{0, 1, 0, 0, 0},
{1, 1, 1, 1, 0}
};
int Su[][3]= {
{1, 0, 0},
{1, 0, 0},
};
int S = 5;// Size of main matrix row
int T = 5;//Size of main matrix column
int M = 2; // size of desire matrix row
int N = 3; // Size of desire matrix column
int flag, i,j,p,q;
for(i=0; i<=(S-M); i++)
{
for(j=0; j<=(T-N); j++)
{
flag=0;
for(p=0; p<M; p++)
{
for(int q=0; q<N; q++)
{
if(Ma[i+p][j+q] != Su[p][q])
{
flag=1;
break;
}
}
}
if(flag==0)
{
printf("Match Found in the Main Matrix at starting location %d, %d",(i+1) ,(j+1));
break;
}
}
if(flag==0)
{
printf("Match Found in the Main Matrix at starting location %d, %d",(i+1) ,(j+1));
break;
}
}
If you want to query multiple times for a same big matrix and same size submatrices. There are many solutions to preprocess the big matrix.
A similar ( or even same ) problem is here.
Fastest way to Find a m x n submatrix in M X N matrix
Since you only want to know whether a given matrix is inside another big matrix. If you know how to use Matlab code from C++, you may directly use ismember from Matlab. Another way may be try to figure out how ismember works in Matlab, then implement the same thing in C++.
See Find location of submatrix
Since you have tagged the question as C++ also, I am providing this code. This is a brute force technique and definitely not the ideal solution for this problem. For an S X T Main Matrix and a M X N Sub Matrix, the time complexity of the algorithm is O(STMN).
cout<<"\nEnter the order of the Main Matrix";
cin>>S>>T;
cout<<"\nEnter the order of the Sub Matrix";
cin>>M>>N;
// Read the Main Matrix into MAT[S][T]
// Read the Sub Matrix into SUB[M][N]
for(i=0; i<(S-M); i++)
{
for(j=0; j<(T-N); j++)
{
flag=0;
for(p=0; p<M; p++)
{
for(q=0; q<N; q++)
{
if(MAT[i+p][j+q] != SUB[p][q])
{
flag=1;
break;
}
}
if(flag==0)
{
cout<<"Match Found in the Main Matrix at starting location "<<(i+1) <<"X"<<(j+1);
break;
}
}
if(flag==0)
{
break;
}
}
if(flag==0)
{
break;
}
}
Much of the answer depends on what you're doing repetitively. Are you testing a bunch of huge matrices for the same submatrix? Are you testing one huge matrix looking for a bunch of different submatrices?
Do any of the matrices have repetitive patterns, or are they nice and random, or can you make no assumptions about the data?
Also, does the submatrix have to be contiguous? Does S contain
F3 = 1 3
7 9
If the data in matrix isn't randomly distributed, it would be helpful to run some statistical analysis on it. Then you could find the sub matrix by comparing its element ranged by their inverse probability. It could be faster, then a plain bruteforce.
Say, you have the matrix of some normally distributed integers with the Gaussian center in 0. And you want to find submatrix say:
1 3 -12
-3 43 -1
198 2 2
You have to start searching for 198, then checking upper right element to be 43 then its upper right for -12, then any 3 or -3 will do; and so on. This would greatly reduce the number of comparisons comparing to the most brutal solution.
My original answer is below the break, thinking about it there are several optimisations, these optimisations refer to the steps of the original answer.
For Step B) do not search the entirety of S: you can discount all columns and rows which would not allow F to fit. (in the below example, only search the upper left 2x2 matrix). In cases where F is a significant proportion of S this would save considerable time.
If the range of values within S is quite low then creating a lookup table would greatly reduce the time required for step B).
Working with these 2 matrices
find inside
A) Select one value from the smaller matrix:
B) locate it within the larger
C) Check the adjacent cells to see if they match
-
It's possible to do in O(N*M*(logN+logM)).
Equality can be expressed as sum of squared differences is 0:
sum[i,j](square(S(n+i,m+j)-F(i,j)))=0
sum[i,j]square(S(n+i,m+j))+sum[i,j](square(F(i,j))-2*sum[i,j](S(n+i,m+j)*F(i,j))=0
First part can be calculated for all (n,m) in O(N*M) similarly to running average.
Second part is calculated as usual in O(sizeof(F)) which is less than O(N*M).
Third part is the most interesting. It's convolution which can be calculated in O(N*M*(logN+logM)) using Fast Fourier Transform: http://en.wikipedia.org/wiki/Convolution#Fast_convolution_algorithms

Efficient layout and reduction of virtual 2d data (abstract)

I use C++ and CUDA/C and want to write code for a specific problem and I ran into a quite tricky reduction problem.
My experience in parallel programming isn't negligible but quite limited and I cannot totally forsee the specificity of this problem.
I doubt there is a convenient or even "easy" way to handle the problems I am facing but perhaps I am wrong.
If there are any resources (i.e. articles, books, web-links, ...) or key-words covering this or similar problems, please let me know.
I tried to generalize the whole case as good as possible and keep it abstract instead of posting too much code.
The Layout ...
I have a system of N inital elements and N result elements. (I'll use N=8 for example but N can be any integral value greater than three.)
static size_t const N = 8;
double init_values[N], result[N];
I need to calculate almost every (not all i'm afraid) unique permutation of the init-values without self-interference.
This means calculation f(init_values[0],init_values[1]), f(init_values[0],init_values[2]), ..., f(init_values[0],init_values[N-1]), f(init_values[1],init_values[2]), ..., f(init_values[1],init_values[N-1]), ... and so on.
This is in fact a virtual triangular matrix which has the shape seen in the following illustration.
P 0 1 2 3 4 5 6 7
|---------------------------------------
0| x
|
1| 0 x
|
2| 1 2 x
|
3| 3 4 5 x
|
4| 6 7 8 9 x
|
5| 10 11 12 13 14 x
|
6| 15 16 17 18 19 20 x
|
7| 21 22 23 24 25 26 27 x
Each element is a function of the respective column and row elements in init_values.
P[i] (= P[row(i)][col(i]) = f(init_values[col(i)], init_values[row(i)])
i.e.
P[11] (= P[5][1]) = f(init_values[1], init_values[5])
There are (N*N-N)/2 = 28 possible, unique combinations (Note: P[1][5]==P[5][1], so we only have a lower (or upper) triangular matrix) using the example N = 8.
The basic problem
The result array is computed from P as a sum of the row elements minus the sum of the respective column elements.
For example the result at position 3 will be calculated as a sum of row 3 minus the sum of column three.
result[3] = (P[3]+P[4]+P[5]) - (P[9]+P[13]+P[18]+P[24])
result[3] = sum_elements_row(3) - sum_elements_column(3)
I tried to illustrate it in a picture with N = 4.
As a consequence the following is true:
N-1 operations (potential concurrent writes) will be performed on each result[i]
result[i] will have N-(i+1) writes from subtractions and i additions
Outgoing from each P[i][j] there will be a subtraction to r[j] and a addition to r[i]
This is where the main problems come into place:
Using one thread to compute each P and updating the result directly will result in multiple kernels trying to write to the same result location (N-1 threads each).
Storing the whole matrix P for a subsequent reduction step on the other hand is very expensive in terms of memory consumption and therefore impossible for very large systems.
The idea of having a unqiue, shared result vector for each thread-block is impossible, too.
(N of 50k makes 2.5 billion P elements and therefore [assuming a maximum number of 1024 threads per block] a minimal number of 2.4 million blocks consuming over 900GiB of memory if each block has its own result array with 50k double elements.)
I think I could handle reduction for a more static behaviour but this problem is rather dynamic in terms of potential concurrent memory write-access.
(Or is it possible to handle it by some "basic" type of reduction?)
Adding some complications ...
Unfortunatelly, depending on (arbitrary user) input, which is independant of the initial values, some elements of P need to be skipped.
Let's assume we need to skip permutations P[6], P[14] and P[18]. Therefore we have 24 combinations left, which need to be calculated.
How to tell the kernel which values need to be skipped?
I came up with three approaches, each having notable downsides if N is very large (like several ten thousands of elements).
1. Store all combinations ...
... with their respective row and column index struct combo { size_t row,col; };, that need to be calculated in a vector<combo> and operate on this vector. (used by the current implementation)
std::vector<combo> elements;
// somehow fill
size_t const M = elements.size();
for (size_t i=0; i<M; ++i)
{
// do the necessary computations using elements[i].row and elements[i].col
}
This solution consumes is consuming lots of memory since only "several" (may even be ten thousands of elements but that's not much in contrast to several billion in total) but it avoids
indexation computations
finding of removed elements
for each element of P which is the downside of the second approach.
2. Operate on all elements of P and find removed elements
If I want to operate on each element of P and avoid nested loops (which i couldn't reproduce very well in cuda) I need to do something like this:
size_t M = (N*N-N)/2;
for (size_t i=0; i<M; ++i)
{
// calculate row indices from `i`
double tmp = sqrt(8.0*double(i+1))/2.0 + 0.5;
double row_d = floor(tmp);
size_t current_row = size_t(row_d);
size_t current_col = size_t(floor(row_d*(ict-row_d)-0.5));
// check whether the current combo of row and col is not to be removed
if (!removes[current_row].exists(current_col))
{
// do the necessary computations using current_row and current_col
}
}
The vector removes is very small in contrast to the elements vector in the first example but the additional computations to obtain current_row, current_col and the if-branch are very inefficient.
(Remember we're still talking about billions of evaluations.)
3. Operate on all elements of P and remove elements afterwards
Another idea I had was to calculate all valid and invalid combinations independently.
But unfortunately, due to summation errors the following statement is true:
calc_non_skipped() != calc_all() - calc_skipped()
Is there a convenient, known, high performance way to get the desired results from the initial values?
I know that this question is rather complicated and perhaps limited in relevance. Nevertheless, I hope some illuminative answers will help me to solve my problems.
The current implementation
Currently this is implemented as CPU Code with OpenMP.
I first set up a vector of the above mentioned combos storing every P that needs to be computed and pass it to a parallel for loop.
Each thread is provided with a private result vector and a critical section at the end of the parallel region is used for a proper summation.
First, I was puzzled for a moment why (N**2 - N)/2 yielded 27 for N=7 ... but for indices 0-7, N=8, and there are 28 elements in P. Shouldn't try to answer questions like this so late in the day. :-)
But on to a potential solution: Do you need to keep the array P for any other purpose? If not, I think you can get the result you want with just two intermediate arrays, each of length N: one for the sum of the rows and one for the sum of the columns.
Here's a quick-and-dirty example of what I think you're trying to do (subroutine direct_approach()) and how to achieve the same result using the intermediate arrays (subroutine refined_approach()):
#include <cstdlib>
#include <cstdio>
const int N = 7;
const float input_values[N] = { 3.0F, 5.0F, 7.0F, 11.0F, 13.0F, 17.0F, 23.0F };
float P[N][N]; // Yes, I'm wasting half the array. This way I don't have to fuss with mapping the indices.
float result1[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float result2[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float f(float arg1, float arg2)
{
// Arbitrary computation
return (arg1 * arg2);
}
float compute_result(int index)
{
float row_sum = 0.0F;
float col_sum = 0.0F;
int row;
int col;
// Compute the row sum
for (col = (index + 1); col < N; col++)
{
row_sum += P[index][col];
}
// Compute the column sum
for (row = 0; row < index; row++)
{
col_sum += P[row][index];
}
return (row_sum - col_sum);
}
void direct_approach()
{
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
P[row][col] = f(input_values[row], input_values[col]);
}
}
int index;
for (index = 0; index < N; index++)
{
result1[index] = compute_result(index);
}
}
void refined_approach()
{
float row_sums[N];
float col_sums[N];
int index;
// Initialize intermediate arrays
for (index = 0; index < N; index++)
{
row_sums[index] = 0.0F;
col_sums[index] = 0.0F;
}
// Compute the row and column sums
// This can be parallelized by computing row and column sums
// independently, instead of in nested loops.
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
float computed = f(input_values[row], input_values[col]);
row_sums[row] += computed;
col_sums[col] += computed;
}
}
// Compute the result
for (index = 0; index < N; index++)
{
result2[index] = row_sums[index] - col_sums[index];
}
}
void print_result(int n, float * result)
{
int index;
for (index = 0; index < n; index++)
{
printf(" [%d]=%f\n", index, result[index]);
}
}
int main(int argc, char * * argv)
{
printf("Data reduction test\n");
direct_approach();
printf("Result 1:\n");
print_result(N, result1);
refined_approach();
printf("Result 2:\n");
print_result(N, result2);
return (0);
}
Parallelizing the computation is not so easy, since each intermediate value is a function of most of the inputs. You can compute the sums individually, but that would mean performing f(...) multiple times. The best suggestion I can think of for very large values of N is to use more intermediate arrays, computing subsets of the results, then summing the partial arrays to yield the final sums. I'd have to think about that one when I'm not so tired.
To cope with the skip issue: If it's a simple matter of "don't use input values x, y, and z", you can store x, y, and z in a do_not_use array and check for those values when looping to compute the sums. If the values to be skipped are some function of row and column, you can store those as pairs and check for the pairs.
Hope this gives you ideas for your solution!
Update, now that I'm awake: Dealing with "skip" depends a lot on what data needs to be skipped. Another possibility for the first case - "don't use input values x, y, and z" - a much faster solution for large data sets would be to add a level of indirection: create yet another array, this one of integer indices, and store only the indices of the good inputs. F'r instance, if invalid data is in inputs 2 and 5, the valid array would be:
int valid_indices[] = { 0, 1, 3, 4, 6 };
Interate over the array valid_indices, and use those indices to retrieve the data from your input array to compute the result. On the other paw, if the values to skip depend on both indices of the P array, I don't see how you can avoid some kind of lookup.
Back to parallelizing - No matter what, you'll be dealing with (N**2 - N)/2 computations
of f(). One possibility is to just accept that there will be contention for the sum
arrays, which would not be a big issue if computing f() takes substantially longer than
the two additions. When you get to very large numbers of parallel paths, contention will
again be an issue, but there should be a "sweet spot" balancing the number of parallel
paths against the time required to compute f().
If contention is still an issue, you can partition the problem several ways. One way is
to compute a row or column at a time: for a row at a time, each column sum can be
computed independently and a running total can be kept for each row sum.
Another approach would be to divide the data space and, thus, the computation into
subsets, where each subset has its own row and column sum arrays. After each block
is computed, the independent arrays can then be summed to produce the values you need
to compute the result.
This probably will be one of those naive and useless answers, but it also might help. Feel free to tell me that I'm utterly and completely wrong and I have misunderstood the whole affair.
So... here we go!
The Basic Problem
It seems to me that you can define you result function a little differently and it will lift at least some contention off your intermediate values. Let's suppose that your P matrix is lower-triangular. If you (virtually) fill the upper triangle with the negative of the lower values (and the main diagonal with all zeros,) then you can redefine each element of your result as the sum of a single row: (shown here for N=4, and where -i means the negative of the value in the cell marked as i)
P 0 1 2 3
|--------------------
0| x -0 -1 -3
|
1| 0 x -2 -4
|
2| 1 2 x -5
|
3| 3 4 5 x
If you launch independent threads (executing the same kernel) to calculate the sum of each row of this matrix, each thread will write a single result element. It seems that your problem size is large enough to saturate your hardware threads and keep them busy.
The caveat, of course, is that you'll be calculating each f(x, y) twice. I don't know how expensive that is, or how much the memory contention was costing you before, so I cannot judge whether this is a worthwhile trade-off to do or not. But unless f was really really expensive, I think it might be.
Skipping Values
You mention that you might have tens of thousands elements of the P matrix that you need to ignore in your calculations (effectively skip them.)
To work with the scheme I've proposed above, I believe you should store the skipped elements as (row, col) pairs, and you have to add the transposed of each coordinate pair too (so you'll have twice the number of skipped values.) So your example skip list of P[6], P[14] and P[18] becomes P(4,0), P(5,4), P(6,3) which then becomes P(4,0), P(5,4), P(6,3), P(0,4), P(4,5), P(3,6).
Then you sort this list, first based on row and then column. This makes our list to be P(0,4), P(3,6), P(4,0), P(4,5), P(5,4), P(6,3).
If each row of your virtual P matrix is processed by one thread (or a single instance of your kernel or whatever,) you can pass it the values it needs to skip. Personally, I would store all these in a big 1D array and just pass in the first and last index that each thread would need to look at (I would also not store the row indices in the final array that I passed in, since it can be implicitly inferred, but I think that's obvious.) In the example above, for N = 8, the begin and end pairs passed to each thread will be: (note that the end is one past the final value needed to be processed, just like STL, so an empty list is denoted by begin == end)
Thread 0: 0..1
Thread 1: 1..1 (or 0..0 or whatever)
Thread 2: 1..1
Thread 3: 1..2
Thread 4: 2..4
Thread 5: 4..5
Thread 6: 5..6
Thread 7: 6..6
Now, each thread goes on to calculate and sum all the intermediate values in a row. While it is stepping through the indices of columns, it is also stepping through this list of skipped values and skipping any column number that comes up in the list. This is obviously an efficient and simple operation (since the list is sorted by column too. It's like merging.)
Pseudo-Implementation
I don't know CUDA, but I have some experience working with OpenCL, and I imagine the interfaces are similar (since the hardware they are targeting are the same.) Here's an implementation of the kernel that does the processing for a row (i.e. calculates one entry of result) in pseudo-C++:
double calc_one_result (
unsigned my_id, unsigned N, double const init_values [],
unsigned skip_indices [], unsigned skip_begin, unsigned skip_end
)
{
double res = 0;
for (unsigned col = 0; col < my_id; ++col)
// "f" seems to take init_values[column] as its first arg
res += f (init_values[col], init_values[my_id]);
for (unsigned row = my_id + 1; row < N; ++row)
res -= f (init_values[my_id], init_values[row]);
// At this point, "res" is holding "result[my_id]",
// including the values that should have been skipped
unsigned i = skip_begin;
// The second condition is to check whether we have reached the
// middle of the virtual matrix or not
for (; i < skip_end && skip_indices[i] < my_id; ++i)
{
unsigned col = skip_indices[i];
res -= f (init_values[col], init_values[my_id]);
}
for (; i < skip_end; ++i)
{
unsigned row = skip_indices[i];
res += f (init_values[my_id], init_values[row]);
}
return res;
}
Note the following:
The semantics of init_values and function f are as described by the question.
This function calculates one entry in the result array; specifically, it calculates result[my_id], so you should launch N instances of this.
The only shared variable it writes to is result[my_id]. Well, the above function doesn't write to anything, but if you translate it to CUDA, I imagine you'd have to write to that at the end. However, no one else writes to that particular element of result, so this write will not cause any contention of data race.
The two input arrays, init_values and skipped_indices are shared among all the running instances of this function.
All accesses to data are linear and sequential, except for the skipped values, which I believe is unavoidable.
skipped_indices contain a list of indices that should be skipped in each row. It's contents and structure are as described above, with one small optimization. Since there was no need, I have removed the row numbers and left only the columns. The row number will be passed into the function as my_id anyways and the slice of the skipped_indices array that should be used by each invocation is determined using skip_begin and skip_end.
For the example above, the array that is passed into all invocations of calc_one_result will look like this:[4, 6, 0, 5, 4, 3].
As you can see, apart from the loops, the only conditional branch in this code is skip_indices[i] < my_id in the third for-loop. Although I believe this is innocuous and totally predictable, even this branch can be easily avoided in the code. We just need to pass in another parameter called skip_middle that tells us where the skipped items cross the main diagonal (i.e. for row #my_id, the index at skipped_indices[skip_middle] is the first that is larger than my_id.)
In Conclusion
I'm by no means an expert in CUDA and HPC. But if I have understood your problem correctly, I think this method might eliminate any and all contentions for memory. Also, I don't think this will cause any (more) numerical stability issues.
The cost of implementing this is:
Calling f twice as many times in total (and keeping track of when it is called for row < col so you can multiply the result by -1.)
Storing twice as many items in the list of skipped values. Since the size of this list is in the thousands (and not billions!) it shouldn't be much of a problem.
Sorting the list of skipped values; which again due to its size, should be no problem.
(UPDATE: Added the Pseudo-Implementation section.)