bit shift operation in parallel prefix sum - opengl

The code is to compute prefix sum parallelly from OpengGL-Superbible 10.
The shader shown has a local workgroup size of 1024, which means it will process arrays of 2048 elements, as each invocation computes two elements of the output array. The shared variable shared_data is used to store the data that is in flight. When execution starts, the shader loads two adjacent elements from the input arrays into the array. Next, it executes the barrier() function. This step ensures that all of the shader invocations have loaded their data into the shared array before the inner loop begins.
#version 450 core
layout (local_size_x = 1024) in;
layout (binding = 0) coherent buffer block1
{
float input_data[gl_WorkGroupSize.x];
};
layout (binding = 1) coherent buffer block2
{
float output_data[gl_WorkGroupSize.x];
};
shared float shared_data[gl_WorkGroupSize.x * 2];
void main(void)
{
uint id = gl_LocalInvocationID.x;
uint rd_id;
uint wr_id;
uint mask;// The number of steps is the log base 2 of the
// work group size, which should be a power of 2
const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
uint step = 0;
// Each invocation is responsible for the content of
// two elements of the output array
shared_data[id * 2] = input_data[id * 2];
shared_data[id * 2 + 1] = input_data[id * 2 + 1];
// Synchronize to make sure that everyone has initialized
// their elements of shared_data[] with data loaded from
// the input arrays
barrier();
memoryBarrierShared();
// For each step...
for (step = 0; step < steps; step++)
{
// Calculate the read and write index in the
// shared array
mask = (1 << step) - 1;
rd_id = ((id >> step) << (step + 1)) + mask;
wr_id = rd_id + 1 + (id & mask);
// Accumulate the read data into our element
shared_data[wr_id] += shared_data[rd_id];
// Synchronize again to make sure that everyone
// has caught up with us
barrier();
memoryBarrierShared();
} // Finally write our data back to the output image
output_data[id * 2] = shared_data[id * 2];
output_data[id * 2 + 1] = shared_data[id * 2 + 1];
}
How to comprehend the bit shift operation of rd_id and wr_id intuitively? Why it works?

When we say something is "intuitive" we usually mean that our understanding is deep enough that we are not aware of our own thought processes, and "know the answer" without consciously thinking about it. Here the author is using the binary representation of integers within a CPU/GPU to make the code shorter and (probably) slightly faster. The code will only be "intuitive" for someone who is very familiar with such encodings and binary operations on integers. I'm not, so had to think about what is going on.
I would recommend working through this code since these kind of operations do occur in high performance graphics and other programming. If you find it interesting, it will eventually become intuitive. If not, that's OK as long as you can figure things out when necessary.
One approach is to just copy this code into a C/C++ program and print out the mask, rd_id, wr_id, etc. You wouldn't actually need the data arrays, or the calls to barrier() and memoryBarrierShared(). Make up values for invocation ID and workgroup size based on what the SuperBible example does. That might be enough for "Aha! I see."
If you aren't familiar with the << and >> shifts, I suggest writing some tiny programs and printing out the numbers that result. Python might actually be slightly easier, since
print("{:016b}".format(mask))
will show you the actual bits, whereas in C you can only print in hex.
To get you started, log2 returns the number of bits needed to represent an integer. log2(256) will be 8, log2(4096) 12, etc. (Don't take my word for it, write some code.)
x << n is multiplying x by 2 to the power n, so x << 1 is x * 2, x << 2 is x * 4, and so on. x >> n is dividing by 1, 2, 4, .. instead.
(Very important: only for non-negative integers! Again, write some code to find out what happens.)
The mask calculation is interesting. Try
mask = (1 << step);
first and see what values come out. This is a common pattern for selecting an individual bit. The extra -1 instead generates all the bits to the right.
Anding, the & operator, with a mask that has zeroes on the left and ones on the right is a faster way for an integer % a power of 2.
Finally rd_id and wr_id array indexes need to start from base positions in the array, from the invocation ID and workgroup size, and increment according to the pattern explained in the Super Bible text.

Related

MPI_Op_create. Problem with user function for scalar product

I have a task to perform scalar product using user functions via MPI_Op_create and MPI_Reduce. I have performed the function, but it works in a wrong way a bit.
The problem itself, as it seems to me, is hidden in replacing invec and inoutvec after one process performs its operations. After the first operation is done correctly, inoutvec stays the same and invec is being replaced by a vector which is filled by 0. Can somebody where the error is hidden?
void func_for_scalar_mult(double *invec, double *inoutvec, int* len, MPI_Datatype
*dtptr)
{
for(int i = 0; i < (*len - 1) / 2; ++i) {
inoutvec[*len - 1] += invec[i] * invec[(*len - 1) / 2 + i];// + inoutvec[i]
*inoutvec[(*len - 1) / 2 + i];
}
}
P.s. The program works correctly using MPI_SUM instead of user function so that I can conclude that the error is hidden in the shared block of code.
I must sorry for misusage of terms due to the lack of experience performing MPI programs. Saying “first operation”, I meant exactly the situation when processes 1 and 2 combine. Vectors which I give to ‘MPI_Reduce’ are filled by dim / (commSize -1) numbers from first vector and dim / (commSize - 1) numbers from the second vector (dim is the dimension of the space) and the last element of ‘invec’ and ‘inoutvec’, as I planned, should be the partial scalar production of first half of ‘invec’ and the second half of ‘invec’. So that the dimension of ‘invec’ is 2 * dim / (commSize - 1).
You say you want to compute a "scalar product". You mean that you multiply all the scalars in a vector together? That's not what the MPI_Op is for: a reduction on an array is a pointwise reduction of components: reduce item 0 from all processes into location 0, reduce item 1 from all processes into location 1, et cetera. I'm not sure if you can somehow hack the reduction to do what you want.
Also, you talk about "the first operation". There is no such thing: the reduction is probably done in a treelike manner: processes 0 and 1 combine, and 2 and 3 combine, and then those partial results combine. So there are probably p/2 "first" operations.

Sparse matrix-dense vector multiplication with matrix known at compile time

I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are). The matrix is known at compile time. I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros. I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.
Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me. My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries. This looks like this:
constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;
template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
// LOTS more of this...
out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}
This does in fact work (takes ages to compile). However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary. I tried this with both std::array and std::vector, and with the individual entries (represented as Bit) being bool, std::uint8_t and int, to no progress worth mentioning. I also tried replacing the modulo and addition by XOR. In conclusion, this is a terrible idea. I'm not sure why though - is the sheer codesize slowing it down that much? Does this kind of code rule out compiler optimization?
I haven't tried any alternatives yet. The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them. Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code. Do you think this is worth trying (I guess I will try anyway on monday)?
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that. I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
Do you have any other ideas on how this might be done?
I'm not sure why though - is the sheer codesize slowing it down that much?
The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device. This process is very slow. The processor will often stall waiting for data to be loaded. And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad. The main issue here is that all the instructions are executed only once. If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM. Thus, the overhead of loading the code is very high compared to its actual execution. To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data.
Does this kind of code rule out compiler optimization?
This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).
Do you think this is worth trying (I guess I will try anyway on monday)?
Yes, this solution is clearly better, especially if the indices are stored in a compact way. In your case, you can store them using an uint16_t type. All the indices can be put in a big buffer. The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers). This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop). With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM. On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized.
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.
Bit-packing is good only if your matrix is not too sparse. With a matrix filled with 50% of non-zero values, the bit-packing method is great. With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.
I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
As previously said, loading data from the storage device or the RAM is very slow. Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).
Here is the approximate timings for various operations that a computer can do:
I implemented the second method (constexpr arrays storing the matrix in compressed column storage format) and it is a lot better. It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). The total executable size is ~1 Mbyte.
Probably one can still do a lot better. If anyone has an idea, let me know!
Below is a code example (showing a 5x10 matrix) illustrating what I did.
#include <iostream>
#include <array>
// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};
template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {
for (std::size_t col = 0; col < N; col++) {
for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
}
}
}
int main() {
using Bit = bool;
std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
std::array<Bit, M> output{};
for (auto i : input) std::cout << i;
std::cout << std::endl;
encode(input, output);
for (auto i : output) std::cout << i;
}

More call to mersenne_twister than there suppose to be

I have a peculiar problem with my current code. I'm writing a program that needs to generate random real number from two distributions (a normal distribution and a real one.) The code to generate these values live inside a for loop :
char* buffer = new char[config.number_of_value * config.sizeof_line()];
//...
//Loop over how much values we want
for(std::size_t i = 0; i < config.number_of_value; ++i)
{
//Calculates the offset where the current line begins (0, sizeof_line * 1, sizeof_line * 2, etc.)
std::size_t line_offset = config.sizeof_line() * i;
//The actual numbers we want to output to the file
double x = next_uniform_real();
double y = config.y_intercept + config.slope * x + next_normal_real();
//Res is the number of character written. The character at buffer[res] is '\0', so we need
//To get rid of it
int res = sprintf((buffer + line_offset), "%f", x);
buffer[line_offset + res] = '0';
//Since we written double_rep_size character, we put the delimiter at double_rep_size index
res = sprintf((buffer + line_offset + config.data_point_character_size() + sizeof(char)), "%f", y);
buffer[line_offset + config.data_point_character_size() + sizeof(char) + res] = '0';
}
return buffer;
When running the program the usual value of "number_of_value" is 100'000. So there should be 100'000 calls to next_uniform_real() et 100'000 calls next_normal_real(). The strange parts is, when I profile this code with VSPerf on Visual Studio 2017 I get 227'242 calls to the mersenne_twister generator, which is 113'621 calls to each functions. As you can see there is 3'621 calls more than there is suppose to be.
Can anyone help me figure this out?
For reference, the functions look like this :
double generator::next_uniform_real()
{
return uniform_real_dist(eng);
}
double generator::next_normal_real()
{
return normal_dist(eng);
}
Where eng is std::mt19937, seeded with a random_device or time(0) when random_device has no entropy.
normal_dist is of type std::normal_real_distribution<>
and uniform_real_dist is of type std::uniform_real_distribution<>
For those wondering, I'm filling up a buffer a char* so that I can make one single write to an ostream rather than one for each iteration of the loop.
(As an aside, if someone knows a faster way to write float/double values to char* or a faster way to generate real numbers than this method, that'd be really helpful!)
All major standard library implementations of std::normal_distribution use the Marsaglia polar method. As noted in the Wikipedia article,
this procedure requires about 27% more evaluations of the underlying random number generator (only π/4 ≈ 79% of generated points lie inside of unit circle).
Your number sounds about right (100000 uniform reals at 1 RNG call per number plus 100000 normal reals at 1.27 RNG calls per number is 227000).
Imagine if you're trying to generate a random integer between 1 and 10 inclusive and your input source provides a random number between 1 and 12 inclusive. If you get a number between 1 and 10, you can just output it. But if you get an 11, you must get another number between 1 and 12. So extra calls may be needed when matching a random source to a random output with a different distribution.

Efficient layout and reduction of virtual 2d data (abstract)

I use C++ and CUDA/C and want to write code for a specific problem and I ran into a quite tricky reduction problem.
My experience in parallel programming isn't negligible but quite limited and I cannot totally forsee the specificity of this problem.
I doubt there is a convenient or even "easy" way to handle the problems I am facing but perhaps I am wrong.
If there are any resources (i.e. articles, books, web-links, ...) or key-words covering this or similar problems, please let me know.
I tried to generalize the whole case as good as possible and keep it abstract instead of posting too much code.
The Layout ...
I have a system of N inital elements and N result elements. (I'll use N=8 for example but N can be any integral value greater than three.)
static size_t const N = 8;
double init_values[N], result[N];
I need to calculate almost every (not all i'm afraid) unique permutation of the init-values without self-interference.
This means calculation f(init_values[0],init_values[1]), f(init_values[0],init_values[2]), ..., f(init_values[0],init_values[N-1]), f(init_values[1],init_values[2]), ..., f(init_values[1],init_values[N-1]), ... and so on.
This is in fact a virtual triangular matrix which has the shape seen in the following illustration.
P 0 1 2 3 4 5 6 7
|---------------------------------------
0| x
|
1| 0 x
|
2| 1 2 x
|
3| 3 4 5 x
|
4| 6 7 8 9 x
|
5| 10 11 12 13 14 x
|
6| 15 16 17 18 19 20 x
|
7| 21 22 23 24 25 26 27 x
Each element is a function of the respective column and row elements in init_values.
P[i] (= P[row(i)][col(i]) = f(init_values[col(i)], init_values[row(i)])
i.e.
P[11] (= P[5][1]) = f(init_values[1], init_values[5])
There are (N*N-N)/2 = 28 possible, unique combinations (Note: P[1][5]==P[5][1], so we only have a lower (or upper) triangular matrix) using the example N = 8.
The basic problem
The result array is computed from P as a sum of the row elements minus the sum of the respective column elements.
For example the result at position 3 will be calculated as a sum of row 3 minus the sum of column three.
result[3] = (P[3]+P[4]+P[5]) - (P[9]+P[13]+P[18]+P[24])
result[3] = sum_elements_row(3) - sum_elements_column(3)
I tried to illustrate it in a picture with N = 4.
As a consequence the following is true:
N-1 operations (potential concurrent writes) will be performed on each result[i]
result[i] will have N-(i+1) writes from subtractions and i additions
Outgoing from each P[i][j] there will be a subtraction to r[j] and a addition to r[i]
This is where the main problems come into place:
Using one thread to compute each P and updating the result directly will result in multiple kernels trying to write to the same result location (N-1 threads each).
Storing the whole matrix P for a subsequent reduction step on the other hand is very expensive in terms of memory consumption and therefore impossible for very large systems.
The idea of having a unqiue, shared result vector for each thread-block is impossible, too.
(N of 50k makes 2.5 billion P elements and therefore [assuming a maximum number of 1024 threads per block] a minimal number of 2.4 million blocks consuming over 900GiB of memory if each block has its own result array with 50k double elements.)
I think I could handle reduction for a more static behaviour but this problem is rather dynamic in terms of potential concurrent memory write-access.
(Or is it possible to handle it by some "basic" type of reduction?)
Adding some complications ...
Unfortunatelly, depending on (arbitrary user) input, which is independant of the initial values, some elements of P need to be skipped.
Let's assume we need to skip permutations P[6], P[14] and P[18]. Therefore we have 24 combinations left, which need to be calculated.
How to tell the kernel which values need to be skipped?
I came up with three approaches, each having notable downsides if N is very large (like several ten thousands of elements).
1. Store all combinations ...
... with their respective row and column index struct combo { size_t row,col; };, that need to be calculated in a vector<combo> and operate on this vector. (used by the current implementation)
std::vector<combo> elements;
// somehow fill
size_t const M = elements.size();
for (size_t i=0; i<M; ++i)
{
// do the necessary computations using elements[i].row and elements[i].col
}
This solution consumes is consuming lots of memory since only "several" (may even be ten thousands of elements but that's not much in contrast to several billion in total) but it avoids
indexation computations
finding of removed elements
for each element of P which is the downside of the second approach.
2. Operate on all elements of P and find removed elements
If I want to operate on each element of P and avoid nested loops (which i couldn't reproduce very well in cuda) I need to do something like this:
size_t M = (N*N-N)/2;
for (size_t i=0; i<M; ++i)
{
// calculate row indices from `i`
double tmp = sqrt(8.0*double(i+1))/2.0 + 0.5;
double row_d = floor(tmp);
size_t current_row = size_t(row_d);
size_t current_col = size_t(floor(row_d*(ict-row_d)-0.5));
// check whether the current combo of row and col is not to be removed
if (!removes[current_row].exists(current_col))
{
// do the necessary computations using current_row and current_col
}
}
The vector removes is very small in contrast to the elements vector in the first example but the additional computations to obtain current_row, current_col and the if-branch are very inefficient.
(Remember we're still talking about billions of evaluations.)
3. Operate on all elements of P and remove elements afterwards
Another idea I had was to calculate all valid and invalid combinations independently.
But unfortunately, due to summation errors the following statement is true:
calc_non_skipped() != calc_all() - calc_skipped()
Is there a convenient, known, high performance way to get the desired results from the initial values?
I know that this question is rather complicated and perhaps limited in relevance. Nevertheless, I hope some illuminative answers will help me to solve my problems.
The current implementation
Currently this is implemented as CPU Code with OpenMP.
I first set up a vector of the above mentioned combos storing every P that needs to be computed and pass it to a parallel for loop.
Each thread is provided with a private result vector and a critical section at the end of the parallel region is used for a proper summation.
First, I was puzzled for a moment why (N**2 - N)/2 yielded 27 for N=7 ... but for indices 0-7, N=8, and there are 28 elements in P. Shouldn't try to answer questions like this so late in the day. :-)
But on to a potential solution: Do you need to keep the array P for any other purpose? If not, I think you can get the result you want with just two intermediate arrays, each of length N: one for the sum of the rows and one for the sum of the columns.
Here's a quick-and-dirty example of what I think you're trying to do (subroutine direct_approach()) and how to achieve the same result using the intermediate arrays (subroutine refined_approach()):
#include <cstdlib>
#include <cstdio>
const int N = 7;
const float input_values[N] = { 3.0F, 5.0F, 7.0F, 11.0F, 13.0F, 17.0F, 23.0F };
float P[N][N]; // Yes, I'm wasting half the array. This way I don't have to fuss with mapping the indices.
float result1[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float result2[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float f(float arg1, float arg2)
{
// Arbitrary computation
return (arg1 * arg2);
}
float compute_result(int index)
{
float row_sum = 0.0F;
float col_sum = 0.0F;
int row;
int col;
// Compute the row sum
for (col = (index + 1); col < N; col++)
{
row_sum += P[index][col];
}
// Compute the column sum
for (row = 0; row < index; row++)
{
col_sum += P[row][index];
}
return (row_sum - col_sum);
}
void direct_approach()
{
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
P[row][col] = f(input_values[row], input_values[col]);
}
}
int index;
for (index = 0; index < N; index++)
{
result1[index] = compute_result(index);
}
}
void refined_approach()
{
float row_sums[N];
float col_sums[N];
int index;
// Initialize intermediate arrays
for (index = 0; index < N; index++)
{
row_sums[index] = 0.0F;
col_sums[index] = 0.0F;
}
// Compute the row and column sums
// This can be parallelized by computing row and column sums
// independently, instead of in nested loops.
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
float computed = f(input_values[row], input_values[col]);
row_sums[row] += computed;
col_sums[col] += computed;
}
}
// Compute the result
for (index = 0; index < N; index++)
{
result2[index] = row_sums[index] - col_sums[index];
}
}
void print_result(int n, float * result)
{
int index;
for (index = 0; index < n; index++)
{
printf(" [%d]=%f\n", index, result[index]);
}
}
int main(int argc, char * * argv)
{
printf("Data reduction test\n");
direct_approach();
printf("Result 1:\n");
print_result(N, result1);
refined_approach();
printf("Result 2:\n");
print_result(N, result2);
return (0);
}
Parallelizing the computation is not so easy, since each intermediate value is a function of most of the inputs. You can compute the sums individually, but that would mean performing f(...) multiple times. The best suggestion I can think of for very large values of N is to use more intermediate arrays, computing subsets of the results, then summing the partial arrays to yield the final sums. I'd have to think about that one when I'm not so tired.
To cope with the skip issue: If it's a simple matter of "don't use input values x, y, and z", you can store x, y, and z in a do_not_use array and check for those values when looping to compute the sums. If the values to be skipped are some function of row and column, you can store those as pairs and check for the pairs.
Hope this gives you ideas for your solution!
Update, now that I'm awake: Dealing with "skip" depends a lot on what data needs to be skipped. Another possibility for the first case - "don't use input values x, y, and z" - a much faster solution for large data sets would be to add a level of indirection: create yet another array, this one of integer indices, and store only the indices of the good inputs. F'r instance, if invalid data is in inputs 2 and 5, the valid array would be:
int valid_indices[] = { 0, 1, 3, 4, 6 };
Interate over the array valid_indices, and use those indices to retrieve the data from your input array to compute the result. On the other paw, if the values to skip depend on both indices of the P array, I don't see how you can avoid some kind of lookup.
Back to parallelizing - No matter what, you'll be dealing with (N**2 - N)/2 computations
of f(). One possibility is to just accept that there will be contention for the sum
arrays, which would not be a big issue if computing f() takes substantially longer than
the two additions. When you get to very large numbers of parallel paths, contention will
again be an issue, but there should be a "sweet spot" balancing the number of parallel
paths against the time required to compute f().
If contention is still an issue, you can partition the problem several ways. One way is
to compute a row or column at a time: for a row at a time, each column sum can be
computed independently and a running total can be kept for each row sum.
Another approach would be to divide the data space and, thus, the computation into
subsets, where each subset has its own row and column sum arrays. After each block
is computed, the independent arrays can then be summed to produce the values you need
to compute the result.
This probably will be one of those naive and useless answers, but it also might help. Feel free to tell me that I'm utterly and completely wrong and I have misunderstood the whole affair.
So... here we go!
The Basic Problem
It seems to me that you can define you result function a little differently and it will lift at least some contention off your intermediate values. Let's suppose that your P matrix is lower-triangular. If you (virtually) fill the upper triangle with the negative of the lower values (and the main diagonal with all zeros,) then you can redefine each element of your result as the sum of a single row: (shown here for N=4, and where -i means the negative of the value in the cell marked as i)
P 0 1 2 3
|--------------------
0| x -0 -1 -3
|
1| 0 x -2 -4
|
2| 1 2 x -5
|
3| 3 4 5 x
If you launch independent threads (executing the same kernel) to calculate the sum of each row of this matrix, each thread will write a single result element. It seems that your problem size is large enough to saturate your hardware threads and keep them busy.
The caveat, of course, is that you'll be calculating each f(x, y) twice. I don't know how expensive that is, or how much the memory contention was costing you before, so I cannot judge whether this is a worthwhile trade-off to do or not. But unless f was really really expensive, I think it might be.
Skipping Values
You mention that you might have tens of thousands elements of the P matrix that you need to ignore in your calculations (effectively skip them.)
To work with the scheme I've proposed above, I believe you should store the skipped elements as (row, col) pairs, and you have to add the transposed of each coordinate pair too (so you'll have twice the number of skipped values.) So your example skip list of P[6], P[14] and P[18] becomes P(4,0), P(5,4), P(6,3) which then becomes P(4,0), P(5,4), P(6,3), P(0,4), P(4,5), P(3,6).
Then you sort this list, first based on row and then column. This makes our list to be P(0,4), P(3,6), P(4,0), P(4,5), P(5,4), P(6,3).
If each row of your virtual P matrix is processed by one thread (or a single instance of your kernel or whatever,) you can pass it the values it needs to skip. Personally, I would store all these in a big 1D array and just pass in the first and last index that each thread would need to look at (I would also not store the row indices in the final array that I passed in, since it can be implicitly inferred, but I think that's obvious.) In the example above, for N = 8, the begin and end pairs passed to each thread will be: (note that the end is one past the final value needed to be processed, just like STL, so an empty list is denoted by begin == end)
Thread 0: 0..1
Thread 1: 1..1 (or 0..0 or whatever)
Thread 2: 1..1
Thread 3: 1..2
Thread 4: 2..4
Thread 5: 4..5
Thread 6: 5..6
Thread 7: 6..6
Now, each thread goes on to calculate and sum all the intermediate values in a row. While it is stepping through the indices of columns, it is also stepping through this list of skipped values and skipping any column number that comes up in the list. This is obviously an efficient and simple operation (since the list is sorted by column too. It's like merging.)
Pseudo-Implementation
I don't know CUDA, but I have some experience working with OpenCL, and I imagine the interfaces are similar (since the hardware they are targeting are the same.) Here's an implementation of the kernel that does the processing for a row (i.e. calculates one entry of result) in pseudo-C++:
double calc_one_result (
unsigned my_id, unsigned N, double const init_values [],
unsigned skip_indices [], unsigned skip_begin, unsigned skip_end
)
{
double res = 0;
for (unsigned col = 0; col < my_id; ++col)
// "f" seems to take init_values[column] as its first arg
res += f (init_values[col], init_values[my_id]);
for (unsigned row = my_id + 1; row < N; ++row)
res -= f (init_values[my_id], init_values[row]);
// At this point, "res" is holding "result[my_id]",
// including the values that should have been skipped
unsigned i = skip_begin;
// The second condition is to check whether we have reached the
// middle of the virtual matrix or not
for (; i < skip_end && skip_indices[i] < my_id; ++i)
{
unsigned col = skip_indices[i];
res -= f (init_values[col], init_values[my_id]);
}
for (; i < skip_end; ++i)
{
unsigned row = skip_indices[i];
res += f (init_values[my_id], init_values[row]);
}
return res;
}
Note the following:
The semantics of init_values and function f are as described by the question.
This function calculates one entry in the result array; specifically, it calculates result[my_id], so you should launch N instances of this.
The only shared variable it writes to is result[my_id]. Well, the above function doesn't write to anything, but if you translate it to CUDA, I imagine you'd have to write to that at the end. However, no one else writes to that particular element of result, so this write will not cause any contention of data race.
The two input arrays, init_values and skipped_indices are shared among all the running instances of this function.
All accesses to data are linear and sequential, except for the skipped values, which I believe is unavoidable.
skipped_indices contain a list of indices that should be skipped in each row. It's contents and structure are as described above, with one small optimization. Since there was no need, I have removed the row numbers and left only the columns. The row number will be passed into the function as my_id anyways and the slice of the skipped_indices array that should be used by each invocation is determined using skip_begin and skip_end.
For the example above, the array that is passed into all invocations of calc_one_result will look like this:[4, 6, 0, 5, 4, 3].
As you can see, apart from the loops, the only conditional branch in this code is skip_indices[i] < my_id in the third for-loop. Although I believe this is innocuous and totally predictable, even this branch can be easily avoided in the code. We just need to pass in another parameter called skip_middle that tells us where the skipped items cross the main diagonal (i.e. for row #my_id, the index at skipped_indices[skip_middle] is the first that is larger than my_id.)
In Conclusion
I'm by no means an expert in CUDA and HPC. But if I have understood your problem correctly, I think this method might eliminate any and all contentions for memory. Also, I don't think this will cause any (more) numerical stability issues.
The cost of implementing this is:
Calling f twice as many times in total (and keeping track of when it is called for row < col so you can multiply the result by -1.)
Storing twice as many items in the list of skipped values. Since the size of this list is in the thousands (and not billions!) it shouldn't be much of a problem.
Sorting the list of skipped values; which again due to its size, should be no problem.
(UPDATE: Added the Pseudo-Implementation section.)

Efficiently Building Summed Area Table

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.
You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
* This restriction may be lifted in AVX2, I'm not sure...
Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.