Splitting a massive matrix into blocks - c++

I have a problem. I'm working on a task that tries to find a matrix (vector) inside another matrix(vector) and the size of the matrices are:
Massive Matrix: 1024x768
Small Matrix: 36x49
Basically, my theory was to split the massive matrix into blocks that were the size of the small matrix thus meaning I was able to just see whether the small matrix exists in which block and then output the block. However, it just will not split equally but I need a way to determine if the small matrix does actually exist in the massive matrix.
As an example, I'll use test data:
M1 =
0 1 0 0
1 1 1 1
0 0 0 0
1 0 1 1
M2 =
0 1
1 1
And then I would split the matrices into blocks of 2x2 and then check that way. This is simple as I'm only working with a small matrix AND the matrix can be split equally, whereas the problem above is a lot more complex to understand and figure out.
In essence, I need to be able to split the (1024x768) into block sizes of (36x49) so then I can do the check to determine where that particular matrix is. I have been working with this algorithm:
// Assume:
// matrix1ColSize = 768
// matrix2ColSize = 49
const int ROW_BOUNDS = matrix1.size() - matrix2.size();
const int COL_BOUNDS = matrix1ColSize - matrix2ColSize;
bool found = false;
for(int i=0; (i < ROW_BOUNDS); i++)
{
bool matchFound = false;
for(int j=0; (j < COL_BOUNDS); j++) {
// logic here
}
cout << endl;
}
Could anyone offer any advice please? This is really annoying me now :(!

Two matrices are the same if all their elements are the same. So the following pseudo-code compares the small matrix with a block in the large matrix:
Initialize result to "true"
For each position in the small matrix
Read the value from the large matrix; call it x1
Read the value from the small matrix; call it x2
If x1 is not equal to x2, set result to "false"
(Optional) If x1 is not equal to x2, stop looking at other positions
Here, use the result
This logic is going to be inside your 2 nested loops, so you will have 4 nested loops there! If you fear of getting confused, put the implementation inside a function. If you want to use 4 nested loops, good luck.
In c++:
bool is_equal = true;
for (int y = 0; y < 49; ++y)
{
for (int x = 0; x < 36; ++x)
{
if (matrix1.at(j + x, i + y) != matrix2.at(x, y))
{
is_equal = false;
goto DONE; // optional
}
}
}
DONE:;
Edit: this code assumes using a custom class for matrices; after looking again at your code i realize that you probably use a vector of vectors (std::vector<std::vector<int>>), so use matrix2[y][x] instead of matrix2.at(x, y).

Related

Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle.
The particle system is ordered in a SOA-pattern like this:
class ParticleSystem
{
private:
float* mXPosition;
float* mYPosition;
float* mZPosition;
.... Rest of code not important for this question
My initial approach I had in mind was just to iterate through the mYPosition array and check for the case stated in the beginning. Perhaps some performance improvmentes could be made with this approach?
The question however is if there is any efficient way to implement this
using the AVX intrinsics? Thank you!
If the elements which are <= 0 are relatively sparse then one simple approach is to test 8 at a time using AVX and then drop into scalar code when you identify a vector which contains one or more such elements, e.g.:
#include <immintrin.h> // AVX intrinsics
const __m256 vk0 = _mm256_setzero_ps(); // const vector of zeros
for (int i = 0; i + 8 <= n; i += 8)
{
__m256 vy = _mm256_loadu_ps(&mYPosition[i]); // load 8 x floats
__m256 vcmp = _mm256_cmp_ps(vy, vk0, _CMP_LE_OS); // compare for <= 0
int mask = _mm256_movemask_ps(vcmp); // get MS bits from comparison result
if (mask != 0) // if any bits set
{ // then we have 1 or more elements <= 0
for (int k = 0; k < 8; ++k) // test each element in vector
{ // using scalar code...
if ((mask & 1) != 0)
{
// found element at index i + k
// do something with it...
}
mask >>= 1;
}
}
}
// deal with any remaining elements in case where n is not a multiple of 8
for (int j = i; j < n; ++j)
{
if (mYPosition[j] <= 0.0f)
{
// found element at index j
// do something with it...
}
}
Of course if the matching elements are not sparse, i.e. if you are typically finding one or more in every vector of 8, then this isn't going to buy you any performance gain. However if the elements are sparse, such that most vectors can be skipped, then you should see a significant benefit.

Vector particles with nested for loops - collisions not detected

So I have some particles (ellipses) bouncing around the screen. I'm trying to get them to collide rather than pass over each other. In order to do this I must cycle through every particle and compare it's distance to every other particle with a for loop nested within another for loop, then tell their velocity to change when their points are a certain distance from each other like so:
//p.size() returns the size of the particle system (yes it works)
//ofDist() is an open frameworks function that calculates the dist between 2 points
for( int i = 0; i < p.size(); i++){
// cout << i << endl;
for(int j = 0; j < p.size(); j++){
// cout << j << endl;
pDist[i] = ofDist(p[i].pos.x, p[i].pos.y, p[j].pos.x, p[j].pos.y);
// cout << pDist[i] << endl;
if(pDist[i] <= 300){
p[i].vel.x *= -1;
p[i].vel.y *= -1;
p[j].vel.x *= -1;
p[j].vel.y *= -1;
}
}
}
But for some mysterious reason they still pass right over each other like they don't even exist. It does work if I apply this to just 2 particles without the for loops:
pDist[0] = ofDist(p[0].pos.x, p[0].pos.y, p[1].pos.x, p[1].pos.y);
if(pDist[0] <= 300){
cout << "It's colliding" << endl;
p[0].vel.x *= -1;
p[0].vel.y *= -1;
p[1].vel.x *= -1;
p[1].vel.y *= -1;
}
The particles are stored in a vector by the way.
Any ideas how I can get this to work with the for loops?
update
The size of my vector is 3, so p.size() = 3 ( or 2, doesn't really make a difference right now). I substituted p.size() for 2 and 3 in my code and it didn't change anything, so that's not the source of the issue.
update 2
If someone could let me know what I need to do to not get downvoted that would be helpful. :/
A pretty large issue is that by saying:
for( int i = 0; i < p.size(); i++){
for(int j = 0; j < p.size(); j++){
You are actually checking each particle against themselves. You are also checking particles collisions twice. By detecting a single collision twice, and inverting the velocity each time, you are essentially doing nothing( a * -1 * -1 = a ).
A better way to do this would be to use a loop where particles collisions are only checked once, and a particle is not checked against itself. You can do this by starting the nested loop after the current particle (essentially offsetting the index by the indexes that have already been checked), like so:
for( int i = 0; i < p.size()-1; i++){
for(int j = i+1; j < p.size(); j++){
This also has the benefit of being significantly faster for a larger number of particles.
There is also no reason to store the calculated distance in an array (unless your code makes use of this somewhere else). Simply using a double would work fine here.
Edit:
Just to be a bit clearer, I have logged the output of the two arrays to demonstrate. I have used 3 particles in the array.
Original loop
1 compared to 1 (This is a problem. Checking a particle against itself)
1 compared to 2
1 compared to 3
2 compared to 1 (This is a problem. This has already been checked for)
2 compared to 2 (This is a problem. Checking a particle against itself)
2 compared to 3
3 compared to 1 (This is a problem. This has already been checked for)
3 compared to 2 (This is a problem. This has already been checked for)
3 compared to 3 (This is a problem. Checking a particle against itself)
Modified loop
1 compared to 2
1 compared to 3
2 compared to 3
As you can see, there are only three collisions checked for in the modified loop, and there are no double ups.

Preparing a Matrix in C++ for Matlab

I have a sparse matrix P of dimension dim*dim given as a pointer through
double* P
/* create the output matrix */
plhs[0] = mxCreateDoubleMatrix(dim,dim,mxREAL);
/* get a pointer to the real data in the output matrix*/
P = mxGetPr(plhs[0]);
I do this in a mex file since I need a lot of for-loops to fill P and c++ is much faster then matlab for that.
For the moment, dim=22500 and it takes about 2 seconds for c++ to fill P (with matlab this task took 50 seconds), and about 100 seconds to normalize the matrix in matlab and again 100 Seconds to erase all zero colums in matlab. I do this with the following code in matlab:
for i=1:size(P,1)
if sum(P(i,:)) > 0
sum(P(i,:))
P(i,:)=(1/sum(P(i,:))).*P(i,:);
end
end
% clear empty rows and colunms
P(~any(P,2),:)=[];
P(:,~any(P))=[];
My question is now: Can I do this in c++ aswell? I tried to normalize P in c++ in the following way:
int i;
int j;
int sum;
int get_idx(int x, int y, int rows) {
return x +y * rows;
}
/* NORMALIZE */
for(i = 0; i <dim; i++) {
sum=0;
for(j=0; j<dim;j++) {
sum = sum + P[get_idx(i,j,dim)];
}
if(sum > 0) {
for(j=0; j<dim;j++) {
P[get_idx(i,j,p_rows)]=P[get_idx(i,j,dim)]*(1/sum);
}
}
}
But for some reason this code does not seem to change P, and also this takes about 85 seconds in c++. Is there a faster way that also works? Also, is it possible to clear empty rows and columns?
Why C++?
Clear the empty rows/columns before normalization - you don't need to normalize empty entries.
Vectorize the normalization:
s = sum(P, 2);
valid = s > 0;
P( valid,: ) = bsxfun(#rdivide, P(valid,:), s(valid) );
Ta-da!
bsxfun is so much fun!
Update: Regarding the reduction of rows/columns.
After a short investigation I think there is a ~x3 speed factor to gain:
Consider these three options:
P( ~any(P,2), :) = []; P( :, ~any(P,1) ) = [];
P( :, ~any(P,1) ) = []; P( ~any(P,2), :) = [];
P = P( any(P,2), any(P,1) );
Test these three alternatives and you'll see that the third one is ~x3 faster, while the first is slight (but consistently) slower than the second.
Why?
If you recall, Matlab stores matices in memory in a column-first fashion therefore eliminating columns before rows saves some copying and re-allocation of memory.
Yet, the first and second alternatives copy and reallocate memory twice: once for rows and once for columns, while the third alternative messes with the memory only once!

Efficient layout and reduction of virtual 2d data (abstract)

I use C++ and CUDA/C and want to write code for a specific problem and I ran into a quite tricky reduction problem.
My experience in parallel programming isn't negligible but quite limited and I cannot totally forsee the specificity of this problem.
I doubt there is a convenient or even "easy" way to handle the problems I am facing but perhaps I am wrong.
If there are any resources (i.e. articles, books, web-links, ...) or key-words covering this or similar problems, please let me know.
I tried to generalize the whole case as good as possible and keep it abstract instead of posting too much code.
The Layout ...
I have a system of N inital elements and N result elements. (I'll use N=8 for example but N can be any integral value greater than three.)
static size_t const N = 8;
double init_values[N], result[N];
I need to calculate almost every (not all i'm afraid) unique permutation of the init-values without self-interference.
This means calculation f(init_values[0],init_values[1]), f(init_values[0],init_values[2]), ..., f(init_values[0],init_values[N-1]), f(init_values[1],init_values[2]), ..., f(init_values[1],init_values[N-1]), ... and so on.
This is in fact a virtual triangular matrix which has the shape seen in the following illustration.
P 0 1 2 3 4 5 6 7
|---------------------------------------
0| x
|
1| 0 x
|
2| 1 2 x
|
3| 3 4 5 x
|
4| 6 7 8 9 x
|
5| 10 11 12 13 14 x
|
6| 15 16 17 18 19 20 x
|
7| 21 22 23 24 25 26 27 x
Each element is a function of the respective column and row elements in init_values.
P[i] (= P[row(i)][col(i]) = f(init_values[col(i)], init_values[row(i)])
i.e.
P[11] (= P[5][1]) = f(init_values[1], init_values[5])
There are (N*N-N)/2 = 28 possible, unique combinations (Note: P[1][5]==P[5][1], so we only have a lower (or upper) triangular matrix) using the example N = 8.
The basic problem
The result array is computed from P as a sum of the row elements minus the sum of the respective column elements.
For example the result at position 3 will be calculated as a sum of row 3 minus the sum of column three.
result[3] = (P[3]+P[4]+P[5]) - (P[9]+P[13]+P[18]+P[24])
result[3] = sum_elements_row(3) - sum_elements_column(3)
I tried to illustrate it in a picture with N = 4.
As a consequence the following is true:
N-1 operations (potential concurrent writes) will be performed on each result[i]
result[i] will have N-(i+1) writes from subtractions and i additions
Outgoing from each P[i][j] there will be a subtraction to r[j] and a addition to r[i]
This is where the main problems come into place:
Using one thread to compute each P and updating the result directly will result in multiple kernels trying to write to the same result location (N-1 threads each).
Storing the whole matrix P for a subsequent reduction step on the other hand is very expensive in terms of memory consumption and therefore impossible for very large systems.
The idea of having a unqiue, shared result vector for each thread-block is impossible, too.
(N of 50k makes 2.5 billion P elements and therefore [assuming a maximum number of 1024 threads per block] a minimal number of 2.4 million blocks consuming over 900GiB of memory if each block has its own result array with 50k double elements.)
I think I could handle reduction for a more static behaviour but this problem is rather dynamic in terms of potential concurrent memory write-access.
(Or is it possible to handle it by some "basic" type of reduction?)
Adding some complications ...
Unfortunatelly, depending on (arbitrary user) input, which is independant of the initial values, some elements of P need to be skipped.
Let's assume we need to skip permutations P[6], P[14] and P[18]. Therefore we have 24 combinations left, which need to be calculated.
How to tell the kernel which values need to be skipped?
I came up with three approaches, each having notable downsides if N is very large (like several ten thousands of elements).
1. Store all combinations ...
... with their respective row and column index struct combo { size_t row,col; };, that need to be calculated in a vector<combo> and operate on this vector. (used by the current implementation)
std::vector<combo> elements;
// somehow fill
size_t const M = elements.size();
for (size_t i=0; i<M; ++i)
{
// do the necessary computations using elements[i].row and elements[i].col
}
This solution consumes is consuming lots of memory since only "several" (may even be ten thousands of elements but that's not much in contrast to several billion in total) but it avoids
indexation computations
finding of removed elements
for each element of P which is the downside of the second approach.
2. Operate on all elements of P and find removed elements
If I want to operate on each element of P and avoid nested loops (which i couldn't reproduce very well in cuda) I need to do something like this:
size_t M = (N*N-N)/2;
for (size_t i=0; i<M; ++i)
{
// calculate row indices from `i`
double tmp = sqrt(8.0*double(i+1))/2.0 + 0.5;
double row_d = floor(tmp);
size_t current_row = size_t(row_d);
size_t current_col = size_t(floor(row_d*(ict-row_d)-0.5));
// check whether the current combo of row and col is not to be removed
if (!removes[current_row].exists(current_col))
{
// do the necessary computations using current_row and current_col
}
}
The vector removes is very small in contrast to the elements vector in the first example but the additional computations to obtain current_row, current_col and the if-branch are very inefficient.
(Remember we're still talking about billions of evaluations.)
3. Operate on all elements of P and remove elements afterwards
Another idea I had was to calculate all valid and invalid combinations independently.
But unfortunately, due to summation errors the following statement is true:
calc_non_skipped() != calc_all() - calc_skipped()
Is there a convenient, known, high performance way to get the desired results from the initial values?
I know that this question is rather complicated and perhaps limited in relevance. Nevertheless, I hope some illuminative answers will help me to solve my problems.
The current implementation
Currently this is implemented as CPU Code with OpenMP.
I first set up a vector of the above mentioned combos storing every P that needs to be computed and pass it to a parallel for loop.
Each thread is provided with a private result vector and a critical section at the end of the parallel region is used for a proper summation.
First, I was puzzled for a moment why (N**2 - N)/2 yielded 27 for N=7 ... but for indices 0-7, N=8, and there are 28 elements in P. Shouldn't try to answer questions like this so late in the day. :-)
But on to a potential solution: Do you need to keep the array P for any other purpose? If not, I think you can get the result you want with just two intermediate arrays, each of length N: one for the sum of the rows and one for the sum of the columns.
Here's a quick-and-dirty example of what I think you're trying to do (subroutine direct_approach()) and how to achieve the same result using the intermediate arrays (subroutine refined_approach()):
#include <cstdlib>
#include <cstdio>
const int N = 7;
const float input_values[N] = { 3.0F, 5.0F, 7.0F, 11.0F, 13.0F, 17.0F, 23.0F };
float P[N][N]; // Yes, I'm wasting half the array. This way I don't have to fuss with mapping the indices.
float result1[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float result2[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float f(float arg1, float arg2)
{
// Arbitrary computation
return (arg1 * arg2);
}
float compute_result(int index)
{
float row_sum = 0.0F;
float col_sum = 0.0F;
int row;
int col;
// Compute the row sum
for (col = (index + 1); col < N; col++)
{
row_sum += P[index][col];
}
// Compute the column sum
for (row = 0; row < index; row++)
{
col_sum += P[row][index];
}
return (row_sum - col_sum);
}
void direct_approach()
{
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
P[row][col] = f(input_values[row], input_values[col]);
}
}
int index;
for (index = 0; index < N; index++)
{
result1[index] = compute_result(index);
}
}
void refined_approach()
{
float row_sums[N];
float col_sums[N];
int index;
// Initialize intermediate arrays
for (index = 0; index < N; index++)
{
row_sums[index] = 0.0F;
col_sums[index] = 0.0F;
}
// Compute the row and column sums
// This can be parallelized by computing row and column sums
// independently, instead of in nested loops.
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
float computed = f(input_values[row], input_values[col]);
row_sums[row] += computed;
col_sums[col] += computed;
}
}
// Compute the result
for (index = 0; index < N; index++)
{
result2[index] = row_sums[index] - col_sums[index];
}
}
void print_result(int n, float * result)
{
int index;
for (index = 0; index < n; index++)
{
printf(" [%d]=%f\n", index, result[index]);
}
}
int main(int argc, char * * argv)
{
printf("Data reduction test\n");
direct_approach();
printf("Result 1:\n");
print_result(N, result1);
refined_approach();
printf("Result 2:\n");
print_result(N, result2);
return (0);
}
Parallelizing the computation is not so easy, since each intermediate value is a function of most of the inputs. You can compute the sums individually, but that would mean performing f(...) multiple times. The best suggestion I can think of for very large values of N is to use more intermediate arrays, computing subsets of the results, then summing the partial arrays to yield the final sums. I'd have to think about that one when I'm not so tired.
To cope with the skip issue: If it's a simple matter of "don't use input values x, y, and z", you can store x, y, and z in a do_not_use array and check for those values when looping to compute the sums. If the values to be skipped are some function of row and column, you can store those as pairs and check for the pairs.
Hope this gives you ideas for your solution!
Update, now that I'm awake: Dealing with "skip" depends a lot on what data needs to be skipped. Another possibility for the first case - "don't use input values x, y, and z" - a much faster solution for large data sets would be to add a level of indirection: create yet another array, this one of integer indices, and store only the indices of the good inputs. F'r instance, if invalid data is in inputs 2 and 5, the valid array would be:
int valid_indices[] = { 0, 1, 3, 4, 6 };
Interate over the array valid_indices, and use those indices to retrieve the data from your input array to compute the result. On the other paw, if the values to skip depend on both indices of the P array, I don't see how you can avoid some kind of lookup.
Back to parallelizing - No matter what, you'll be dealing with (N**2 - N)/2 computations
of f(). One possibility is to just accept that there will be contention for the sum
arrays, which would not be a big issue if computing f() takes substantially longer than
the two additions. When you get to very large numbers of parallel paths, contention will
again be an issue, but there should be a "sweet spot" balancing the number of parallel
paths against the time required to compute f().
If contention is still an issue, you can partition the problem several ways. One way is
to compute a row or column at a time: for a row at a time, each column sum can be
computed independently and a running total can be kept for each row sum.
Another approach would be to divide the data space and, thus, the computation into
subsets, where each subset has its own row and column sum arrays. After each block
is computed, the independent arrays can then be summed to produce the values you need
to compute the result.
This probably will be one of those naive and useless answers, but it also might help. Feel free to tell me that I'm utterly and completely wrong and I have misunderstood the whole affair.
So... here we go!
The Basic Problem
It seems to me that you can define you result function a little differently and it will lift at least some contention off your intermediate values. Let's suppose that your P matrix is lower-triangular. If you (virtually) fill the upper triangle with the negative of the lower values (and the main diagonal with all zeros,) then you can redefine each element of your result as the sum of a single row: (shown here for N=4, and where -i means the negative of the value in the cell marked as i)
P 0 1 2 3
|--------------------
0| x -0 -1 -3
|
1| 0 x -2 -4
|
2| 1 2 x -5
|
3| 3 4 5 x
If you launch independent threads (executing the same kernel) to calculate the sum of each row of this matrix, each thread will write a single result element. It seems that your problem size is large enough to saturate your hardware threads and keep them busy.
The caveat, of course, is that you'll be calculating each f(x, y) twice. I don't know how expensive that is, or how much the memory contention was costing you before, so I cannot judge whether this is a worthwhile trade-off to do or not. But unless f was really really expensive, I think it might be.
Skipping Values
You mention that you might have tens of thousands elements of the P matrix that you need to ignore in your calculations (effectively skip them.)
To work with the scheme I've proposed above, I believe you should store the skipped elements as (row, col) pairs, and you have to add the transposed of each coordinate pair too (so you'll have twice the number of skipped values.) So your example skip list of P[6], P[14] and P[18] becomes P(4,0), P(5,4), P(6,3) which then becomes P(4,0), P(5,4), P(6,3), P(0,4), P(4,5), P(3,6).
Then you sort this list, first based on row and then column. This makes our list to be P(0,4), P(3,6), P(4,0), P(4,5), P(5,4), P(6,3).
If each row of your virtual P matrix is processed by one thread (or a single instance of your kernel or whatever,) you can pass it the values it needs to skip. Personally, I would store all these in a big 1D array and just pass in the first and last index that each thread would need to look at (I would also not store the row indices in the final array that I passed in, since it can be implicitly inferred, but I think that's obvious.) In the example above, for N = 8, the begin and end pairs passed to each thread will be: (note that the end is one past the final value needed to be processed, just like STL, so an empty list is denoted by begin == end)
Thread 0: 0..1
Thread 1: 1..1 (or 0..0 or whatever)
Thread 2: 1..1
Thread 3: 1..2
Thread 4: 2..4
Thread 5: 4..5
Thread 6: 5..6
Thread 7: 6..6
Now, each thread goes on to calculate and sum all the intermediate values in a row. While it is stepping through the indices of columns, it is also stepping through this list of skipped values and skipping any column number that comes up in the list. This is obviously an efficient and simple operation (since the list is sorted by column too. It's like merging.)
Pseudo-Implementation
I don't know CUDA, but I have some experience working with OpenCL, and I imagine the interfaces are similar (since the hardware they are targeting are the same.) Here's an implementation of the kernel that does the processing for a row (i.e. calculates one entry of result) in pseudo-C++:
double calc_one_result (
unsigned my_id, unsigned N, double const init_values [],
unsigned skip_indices [], unsigned skip_begin, unsigned skip_end
)
{
double res = 0;
for (unsigned col = 0; col < my_id; ++col)
// "f" seems to take init_values[column] as its first arg
res += f (init_values[col], init_values[my_id]);
for (unsigned row = my_id + 1; row < N; ++row)
res -= f (init_values[my_id], init_values[row]);
// At this point, "res" is holding "result[my_id]",
// including the values that should have been skipped
unsigned i = skip_begin;
// The second condition is to check whether we have reached the
// middle of the virtual matrix or not
for (; i < skip_end && skip_indices[i] < my_id; ++i)
{
unsigned col = skip_indices[i];
res -= f (init_values[col], init_values[my_id]);
}
for (; i < skip_end; ++i)
{
unsigned row = skip_indices[i];
res += f (init_values[my_id], init_values[row]);
}
return res;
}
Note the following:
The semantics of init_values and function f are as described by the question.
This function calculates one entry in the result array; specifically, it calculates result[my_id], so you should launch N instances of this.
The only shared variable it writes to is result[my_id]. Well, the above function doesn't write to anything, but if you translate it to CUDA, I imagine you'd have to write to that at the end. However, no one else writes to that particular element of result, so this write will not cause any contention of data race.
The two input arrays, init_values and skipped_indices are shared among all the running instances of this function.
All accesses to data are linear and sequential, except for the skipped values, which I believe is unavoidable.
skipped_indices contain a list of indices that should be skipped in each row. It's contents and structure are as described above, with one small optimization. Since there was no need, I have removed the row numbers and left only the columns. The row number will be passed into the function as my_id anyways and the slice of the skipped_indices array that should be used by each invocation is determined using skip_begin and skip_end.
For the example above, the array that is passed into all invocations of calc_one_result will look like this:[4, 6, 0, 5, 4, 3].
As you can see, apart from the loops, the only conditional branch in this code is skip_indices[i] < my_id in the third for-loop. Although I believe this is innocuous and totally predictable, even this branch can be easily avoided in the code. We just need to pass in another parameter called skip_middle that tells us where the skipped items cross the main diagonal (i.e. for row #my_id, the index at skipped_indices[skip_middle] is the first that is larger than my_id.)
In Conclusion
I'm by no means an expert in CUDA and HPC. But if I have understood your problem correctly, I think this method might eliminate any and all contentions for memory. Also, I don't think this will cause any (more) numerical stability issues.
The cost of implementing this is:
Calling f twice as many times in total (and keeping track of when it is called for row < col so you can multiply the result by -1.)
Storing twice as many items in the list of skipped values. Since the size of this list is in the thousands (and not billions!) it shouldn't be much of a problem.
Sorting the list of skipped values; which again due to its size, should be no problem.
(UPDATE: Added the Pseudo-Implementation section.)

Interesting algorithm to compare two matrices?

I have a problem, basically, I have two matrices (vectors), one massive matrix and a smaller matrix. I have an algorithm that splits the massive matrix into blocks (of the size of the small block)
So for example (I am using test data here) so the massive matrix size is: 4x4 and the small matrix is 2x2 and then I pass the particular block (at the current position) to a function that checks to see if the small matrix is equal to the massive block (at that particular position) if it is, then returns true otherwise returns false.
I can output each block like this:
bool compareMatrix(vector<double> &theMatrix1, vector<double> &theMatrix2, int startRow, int startCol)
{
// I can output the matrix blocks like this:
cout << theMatrix1[startRow*2+startCol] << endl;
}
But I don't quite understand how I would compare the block (at the startingRow/Col) to the small matrix..
How it would is this:
Matrix 1: (4x4)
0 1 0 1
1 1 0 1
0 0 1 1
0 1 1 1
Matrix 2: (2x2)
0 1
0 1
I then split the blocks into 2x2:
B1 =
0 1
1 1
is B1 equal to theMatrix2 - No so return false
B2 =
0 1
0 1
is B2 equal to theMatrix2 - Yes so return true
I have really tried to explain things to the best of detail as I possibly can and hope someone can give me some advice because I've been working on it for so long now!
Thanks
If the size of the big matrix is known you can compare small parts of it with your 2x2 matrix like so
int bigMatrixSize=4;
bool compare(...)
{
for (int i=0; i<2; ++i)
for (int k=0; k<2; ++k)
if(bigMatrix[k+startX+(i+staryY)*bigMatrixSize] != smallMatrix[k][i])
return false;
return true;
}
I left out bounds checking and some other stuff, but It should give you an idea.
bool compareMatrix(vector<double> &theMatrix1, int nRow1, int nCol1, vector<double> &theMatrix2, int nRow2, int nCol2, int startRow, int startCol)
{
int p1 = startRow * nCol1 + startCol, p2 = 0;
for (int y = 0; y < nRow2; ++y)
{
for (int x = 0; x < nCol2; ++x)
{
if (theMatrix1[p1 + x] != theMattrix2[p2 + x]) // You can use memcmp here, but it's safer let compiler do the optimization.
{
return false;
}
}
p1 += nCol1;
p2 += nCol2;
}
return true;
}
You want something like this? You can add the columns count to the position to reach the next row.