What is the most efficient way to repeat elements in a vector and apply a set of different functions across all elements using Eigen? - c++

Say I have a vector containing only positive, real elements defined like this:
Eigen::VectorXd v(1.3876, 8.6983, 5.438, 3.9865, 4.5673);
I want to generate a new vector v2 that has repeated the elements in v some k times. Then I want to apply k different functions to each of the repeated elements in the vector.
For example, if v2 was v repeated 2 times and I applied floor() and ceil() as my two functions, the result based on the above vector would be a column vector with values: [1; 2; 8; 9; 5; 6; 3; 4; 4; 5]. Preserving the order of the original values is important here as well. These values are also a simplified example, in practice, I'm generating vectors v with ~100,000 or more elements and would like to make my code as vectorizable as possible.
Since I'm coming to Eigen and C++ from Matlab, the simplest approach I first took was to just convert this Nx1 vector into an Nx2 matrix, apply floor to the first column and ceil to the second column, take the transpose to get a 2xN matrix and then exploit the column-major nature of the matrix and reshape the 2xN matrix into a 2Nx1 vector, yielding the result I want. However, for large vectors, this would be very slow and inefficient.
This response by ggael effectively addresses how I could repeat the elements in the input vector by generating a sequence of indices and indexing the input vector. I could just then generate more sequences of indices to apply my functions to the relevant elements v2 and copy the result back to their respective places. However, is this really the most efficient approach? I dont fully grasp copy-on-write and move semantics, but I think the second indexing expressions would be in a sense redundant?
If that is true, then my guess is that a solution here would be some sort of nullary or unary expression where I could define an expression that accepts the vector, some index k and k expressions/functions to apply to each element and spits out the vector I'm looking for. I've read the Eigen documentation on the subject, but I'm struggling to build a functional example. Any help would be appreciated!

So, if I understand you correctly, you don't want to replicate (in terms of Eigen methods) the vector, you want to apply different methods to the same elements and store the result for each, correct?
In this case, computing it sequentially once per function is the easiest route. Most CPUs can only do one (vector) memory store per clock cycle, anyway. So for simple unary or binary operations, your gains have an upper bound.
Still, you are correct that one load is technically always better than two and it is a limitation of Eigen that there is no good way of achieving this.
Know that even if you manually write a loop that would generate multiple outputs, you should limit yourself in the number of outputs. CPUs have a limited number of line-fill buffers. IIRC Intel recommended using less than 10 "output streams" in tight loops, otherwise you could stall the CPU on those.
Another aspect is that C++'s weak aliasing restrictions make it hard for compilers to vectorize code with multiple outputs. So it might even be detrimental.
How I would structure this code
Remember that Eigen is column-major, just like Matlab. Therefore use one column per output function. Or just use separate vectors to begin with.
Eigen::VectorXd v = ...;
Eigen::MatrixX2d out(v.size(), 2);
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
Following the KISS principle, this is good enough. You will not gain much if anything by doing something more complicated. A bit of multithreading might gain you something (less than factor 2 I would guess) because a single CPU thread is not enough to max out memory bandwidth but that's about it.
Some benchmarking
This is my baseline:
int main()
{
int rows = 100013, repetitions = 100000;
Eigen::VectorXd v = Eigen::VectorXd::Random(rows);
Eigen::MatrixX2d out(rows, 2);
for(int i = 0; i < repetitions; ++i) {
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
}
}
Compiled with gcc-11, -O3 -mavx2 -fno-math-errno I get ca. 5.7 seconds.
Inspecting the assembler code finds good vectorization.
Plain old C++ version:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
for(std::ptrdiff_t j = 0; j < rows; ++j) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
40 seconds instead of 5! This version actually does not vectorize because the compiler cannot prove that the arrays don't alias each other.
Next, let's use fixed size Eigen vectors to get the compiler to generate vectorized code:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
std::ptrdiff_t j;
for(j = 0; j + 4 <= rows; j += 4) {
const Eigen::Vector4d vj = Eigen::Vector4d::Map(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector4d::Map(outfloor + j) = floorval;
Eigen::Vector4d::Map(outceil + j) = ceilval;;
}
if(j + 2 <= rows) {
const Eigen::Vector2d vj = Eigen::Vector2d::MapAligned(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector2d::Map(outfloor + j) = floorval;
Eigen::Vector2d::Map(outceil + j) = ceilval;;
j += 2;
}
if(j < rows) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
7.5 seconds. The assembler looks fine, fully vectorized. I'm not sure why performance is lower. Maybe cache line aliasing?
Last attempt: We don't try to avoid re-reading the vector but we re-read it blockwise so that it will be in cache by the time we read it a second time.
const int blocksize = 64 * 1024 / sizeof(double);
std::ptrdiff_t j;
for(j = 0; j + blocksize <= rows; j += blocksize) {
const auto& vj = v.segment(j, blocksize);
auto outj = out.middleRows(j, blocksize);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
}
const auto& vj = v.tail(rows - j);
auto outj = out.bottomRows(rows - j);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
5.4 seconds. So there is some gain here but not nearly enough to justify the added complexity.

Related

Correct datastructure for user specified integer mapping into rows of a matrix

I have a C++ class that manipulates an NxM matrix. The rows individually are meaningful, but the C++ contiguous indexing [0,1,2,...,N-1] is not. The users find it preferable to choose an indexing which has meaning to them, e.g., for a 3 row matrix, the user may wish to have the integer -3 label row zero, -1 label row 1, and 3 label row 2.
I may assume that 1) the labels are integers, and 2) the labels are monotonically increasing, and 3) the number of rows is not huge. I may not assume the labels are continuous, or even gapped with even spacing. The pseudocode is below:
template<typename T>
class Foo {
public:
Foo(std::vector<int> labels, int columns) {
m_.resize(labels.size()*columns);
}
void update(int label, T value) {
// map label to index, update the entry in the matrix:
int idx = ...;
m_[idx] = value;
}
std::vector<T> get_row(int label) {
// Map label
}
private:
// A matrix:
std::vector<T> m_;
// What datastructure should I use here?
SomeDataStructure label_to_row_;
};
The call to update must be extremely fast. What is the best datastructure to use to quickly map the label to the row of the matrix?
Theoretically speaking, hash maps are the fastest containers for what you're trying to achieve (with O(1)) complexity. But in practice, there are a couple of things you can do.
First of all, you can have multiple implementations using different data structures and choose to return one of these based on the given indices at runtime (using abstract classes or other similar ways). You can do this on the structures I propose below and choose one at runtime.
If you know that the range of data is small (or you can detect it at runtime), Then the problem is easy. Just create a vector that has the same size as the range of data and set the ordered index in this vector:
std::vector<int> indices = {/*data*/};
auto minmax = std::minmax_element(indices.begin(), indices.end());
int min = *minmax.first, max = *minmax.second, range = max - min;
std::vector<int> index_map(range);
for (size_t i = 0; i < indices.size(); ++i) index_map[indices[i] - min] = i;
I hope you got what I'm trying to say because I feel like I didn't explain it very well.
If your range of data is large but the minimum spacing between them is also larger than 1, then you can do the previous method with a small modification:
std::vector<int> indices = {/*data*/};
auto minmax = std::minmax_element(indices.begin(), indices.end());
int min = *minmax.first, max = *minmax.second, range = max - min;
// Assuming indices are sorted
int diff = std::numeric_limits<int>::max();
for (size_t i = 0; i < indices.size() - 1; ++i) diff = std::min(diff, indices[i] - indices[i + 1]);
// diff can't be zero
std::vector<int> index_map(range);
for (size_t i = 0; i < indices.size(); ++i) index_map[(indices[i] - min) / diff] = i;
Here we find the minimum spacing between indices and divide by that.
Use an optimized 3rd party map that is optimized further (using vectorization, multi-threading, and other methods) like these.
Maybe you can try to use a weaker but faster hash function since the number of indices are not large.
I'll add to the list if I think of anything else.

Performance optimization nested loops

I am implementing a rather complicated code and in one of the critical sections I need to basically consider all the possible strings of numbers following a certain rule. The naive implementation to explain what I do would be such a nested loop implementation:
std::array<int,3> max = { 3, 4, 6};
for(int i = 0; i <= max.at(0); ++i){
for(int j = 0; j <= max.at(1); ++j){
for(int k = 0; k <= max.at(2); ++k){
DoSomething(i, j, k);
}
}
}
Obviously I actually need more nested for and the "max" rule is more complicated but the idea is clear I think.
I implemented this idea using a recursive function approach:
std::array<int,3> max = { 3, 4, 6};
std::array<int,3> index = {0, 0, 0};
int total_depth = 3;
recursive_nested_for(0, index, max, total_depth);
where
void recursive_nested_for(int depth, std::array<int,3>& index,
std::array<int,3>& max, int total_depth)
{
if(depth != total_depth){
for(int i = 0; i <= max.at(depth); ++i){
index.at(depth) = i;
recursive_nested_for(depth+1, index, max, total_depth);
}
}
else
DoSomething(index);
}
In order to save as much as possible I declare all the variable I use global in the actual code.
Since this part of the code takes really long is it possible to do anything to speed it up?
I would also be open to write 24 nested for if necessary to avoid the overhead at least!
I thought that maybe an approach like expressions templates to actually generate at compile time these nested for could be more elegant. But is it possible?
Any suggestion would be greatly appreciated.
Thanks to all.
The recursive_nested_for() is a nice idea. It's a bit inflexible as it is currently written. However, you could use std::vector<int> for the array dimensions and indices, or make it a template to handle any size std::array<>. The compiler might be able to inline all recursive calls if it knows how deep the recursion is, and then it will probably be just as efficient as the three nested for-loops.
Another option is to use a single for loop for incrementing the indices that need incrementing:
void nested_for(std::array<int,3>& index, std::array<int,3>& max)
{
while (index.at(2) < max.at(2)) {
DoSomething(index);
// Increment indices
for (int i = 0; i < 3; ++i) {
if (++index.at(i) >= max.at(i))
index.at(i) = 0;
else
break;
}
}
}
However, you can also consider creating a linear sequence that visits all possible combinations of the iterators i, j, k and so on. For example, with array dimensions {3, 4, 6}, there are 3 * 4 * 6 = 72 possible combinations. So you can have a single counter going from 0 to 72, and then "split" that counter into the three iterator values you need, like so:
for (int c = 0; c < 72; c++) {
int k = c % 6;
int j = (c / 6) % 4;
int i = c / 6 / 4;
DoSomething(i, j, k);
}
You can generalize this to as many dimensions as you want. Of course, the more dimensions you have, the higher the cost of splitting the linear iterator. But if your array dimensions are powers of two, it might be very cheap to do so. Also, it might be that you don't need to split it at all; for example if you are calculating the sum of all elements of a multidimensional array, you don't care about the actual indices i, j, k and so on, you just want to visit all elements once. If the array is layed out linearly in memory, then you just need a linear iterator.
Of course, if you have 24 nested for loops, you'll notice that the product of all the dimension's sizes will become a very large number. If it doesn't fit in a 32 bit integer, your code is going to be very slow. If it doesn't fit into a 64 bit integer anymore, it will never finish.

Efficient layout and reduction of virtual 2d data (abstract)

I use C++ and CUDA/C and want to write code for a specific problem and I ran into a quite tricky reduction problem.
My experience in parallel programming isn't negligible but quite limited and I cannot totally forsee the specificity of this problem.
I doubt there is a convenient or even "easy" way to handle the problems I am facing but perhaps I am wrong.
If there are any resources (i.e. articles, books, web-links, ...) or key-words covering this or similar problems, please let me know.
I tried to generalize the whole case as good as possible and keep it abstract instead of posting too much code.
The Layout ...
I have a system of N inital elements and N result elements. (I'll use N=8 for example but N can be any integral value greater than three.)
static size_t const N = 8;
double init_values[N], result[N];
I need to calculate almost every (not all i'm afraid) unique permutation of the init-values without self-interference.
This means calculation f(init_values[0],init_values[1]), f(init_values[0],init_values[2]), ..., f(init_values[0],init_values[N-1]), f(init_values[1],init_values[2]), ..., f(init_values[1],init_values[N-1]), ... and so on.
This is in fact a virtual triangular matrix which has the shape seen in the following illustration.
P 0 1 2 3 4 5 6 7
|---------------------------------------
0| x
|
1| 0 x
|
2| 1 2 x
|
3| 3 4 5 x
|
4| 6 7 8 9 x
|
5| 10 11 12 13 14 x
|
6| 15 16 17 18 19 20 x
|
7| 21 22 23 24 25 26 27 x
Each element is a function of the respective column and row elements in init_values.
P[i] (= P[row(i)][col(i]) = f(init_values[col(i)], init_values[row(i)])
i.e.
P[11] (= P[5][1]) = f(init_values[1], init_values[5])
There are (N*N-N)/2 = 28 possible, unique combinations (Note: P[1][5]==P[5][1], so we only have a lower (or upper) triangular matrix) using the example N = 8.
The basic problem
The result array is computed from P as a sum of the row elements minus the sum of the respective column elements.
For example the result at position 3 will be calculated as a sum of row 3 minus the sum of column three.
result[3] = (P[3]+P[4]+P[5]) - (P[9]+P[13]+P[18]+P[24])
result[3] = sum_elements_row(3) - sum_elements_column(3)
I tried to illustrate it in a picture with N = 4.
As a consequence the following is true:
N-1 operations (potential concurrent writes) will be performed on each result[i]
result[i] will have N-(i+1) writes from subtractions and i additions
Outgoing from each P[i][j] there will be a subtraction to r[j] and a addition to r[i]
This is where the main problems come into place:
Using one thread to compute each P and updating the result directly will result in multiple kernels trying to write to the same result location (N-1 threads each).
Storing the whole matrix P for a subsequent reduction step on the other hand is very expensive in terms of memory consumption and therefore impossible for very large systems.
The idea of having a unqiue, shared result vector for each thread-block is impossible, too.
(N of 50k makes 2.5 billion P elements and therefore [assuming a maximum number of 1024 threads per block] a minimal number of 2.4 million blocks consuming over 900GiB of memory if each block has its own result array with 50k double elements.)
I think I could handle reduction for a more static behaviour but this problem is rather dynamic in terms of potential concurrent memory write-access.
(Or is it possible to handle it by some "basic" type of reduction?)
Adding some complications ...
Unfortunatelly, depending on (arbitrary user) input, which is independant of the initial values, some elements of P need to be skipped.
Let's assume we need to skip permutations P[6], P[14] and P[18]. Therefore we have 24 combinations left, which need to be calculated.
How to tell the kernel which values need to be skipped?
I came up with three approaches, each having notable downsides if N is very large (like several ten thousands of elements).
1. Store all combinations ...
... with their respective row and column index struct combo { size_t row,col; };, that need to be calculated in a vector<combo> and operate on this vector. (used by the current implementation)
std::vector<combo> elements;
// somehow fill
size_t const M = elements.size();
for (size_t i=0; i<M; ++i)
{
// do the necessary computations using elements[i].row and elements[i].col
}
This solution consumes is consuming lots of memory since only "several" (may even be ten thousands of elements but that's not much in contrast to several billion in total) but it avoids
indexation computations
finding of removed elements
for each element of P which is the downside of the second approach.
2. Operate on all elements of P and find removed elements
If I want to operate on each element of P and avoid nested loops (which i couldn't reproduce very well in cuda) I need to do something like this:
size_t M = (N*N-N)/2;
for (size_t i=0; i<M; ++i)
{
// calculate row indices from `i`
double tmp = sqrt(8.0*double(i+1))/2.0 + 0.5;
double row_d = floor(tmp);
size_t current_row = size_t(row_d);
size_t current_col = size_t(floor(row_d*(ict-row_d)-0.5));
// check whether the current combo of row and col is not to be removed
if (!removes[current_row].exists(current_col))
{
// do the necessary computations using current_row and current_col
}
}
The vector removes is very small in contrast to the elements vector in the first example but the additional computations to obtain current_row, current_col and the if-branch are very inefficient.
(Remember we're still talking about billions of evaluations.)
3. Operate on all elements of P and remove elements afterwards
Another idea I had was to calculate all valid and invalid combinations independently.
But unfortunately, due to summation errors the following statement is true:
calc_non_skipped() != calc_all() - calc_skipped()
Is there a convenient, known, high performance way to get the desired results from the initial values?
I know that this question is rather complicated and perhaps limited in relevance. Nevertheless, I hope some illuminative answers will help me to solve my problems.
The current implementation
Currently this is implemented as CPU Code with OpenMP.
I first set up a vector of the above mentioned combos storing every P that needs to be computed and pass it to a parallel for loop.
Each thread is provided with a private result vector and a critical section at the end of the parallel region is used for a proper summation.
First, I was puzzled for a moment why (N**2 - N)/2 yielded 27 for N=7 ... but for indices 0-7, N=8, and there are 28 elements in P. Shouldn't try to answer questions like this so late in the day. :-)
But on to a potential solution: Do you need to keep the array P for any other purpose? If not, I think you can get the result you want with just two intermediate arrays, each of length N: one for the sum of the rows and one for the sum of the columns.
Here's a quick-and-dirty example of what I think you're trying to do (subroutine direct_approach()) and how to achieve the same result using the intermediate arrays (subroutine refined_approach()):
#include <cstdlib>
#include <cstdio>
const int N = 7;
const float input_values[N] = { 3.0F, 5.0F, 7.0F, 11.0F, 13.0F, 17.0F, 23.0F };
float P[N][N]; // Yes, I'm wasting half the array. This way I don't have to fuss with mapping the indices.
float result1[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float result2[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float f(float arg1, float arg2)
{
// Arbitrary computation
return (arg1 * arg2);
}
float compute_result(int index)
{
float row_sum = 0.0F;
float col_sum = 0.0F;
int row;
int col;
// Compute the row sum
for (col = (index + 1); col < N; col++)
{
row_sum += P[index][col];
}
// Compute the column sum
for (row = 0; row < index; row++)
{
col_sum += P[row][index];
}
return (row_sum - col_sum);
}
void direct_approach()
{
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
P[row][col] = f(input_values[row], input_values[col]);
}
}
int index;
for (index = 0; index < N; index++)
{
result1[index] = compute_result(index);
}
}
void refined_approach()
{
float row_sums[N];
float col_sums[N];
int index;
// Initialize intermediate arrays
for (index = 0; index < N; index++)
{
row_sums[index] = 0.0F;
col_sums[index] = 0.0F;
}
// Compute the row and column sums
// This can be parallelized by computing row and column sums
// independently, instead of in nested loops.
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
float computed = f(input_values[row], input_values[col]);
row_sums[row] += computed;
col_sums[col] += computed;
}
}
// Compute the result
for (index = 0; index < N; index++)
{
result2[index] = row_sums[index] - col_sums[index];
}
}
void print_result(int n, float * result)
{
int index;
for (index = 0; index < n; index++)
{
printf(" [%d]=%f\n", index, result[index]);
}
}
int main(int argc, char * * argv)
{
printf("Data reduction test\n");
direct_approach();
printf("Result 1:\n");
print_result(N, result1);
refined_approach();
printf("Result 2:\n");
print_result(N, result2);
return (0);
}
Parallelizing the computation is not so easy, since each intermediate value is a function of most of the inputs. You can compute the sums individually, but that would mean performing f(...) multiple times. The best suggestion I can think of for very large values of N is to use more intermediate arrays, computing subsets of the results, then summing the partial arrays to yield the final sums. I'd have to think about that one when I'm not so tired.
To cope with the skip issue: If it's a simple matter of "don't use input values x, y, and z", you can store x, y, and z in a do_not_use array and check for those values when looping to compute the sums. If the values to be skipped are some function of row and column, you can store those as pairs and check for the pairs.
Hope this gives you ideas for your solution!
Update, now that I'm awake: Dealing with "skip" depends a lot on what data needs to be skipped. Another possibility for the first case - "don't use input values x, y, and z" - a much faster solution for large data sets would be to add a level of indirection: create yet another array, this one of integer indices, and store only the indices of the good inputs. F'r instance, if invalid data is in inputs 2 and 5, the valid array would be:
int valid_indices[] = { 0, 1, 3, 4, 6 };
Interate over the array valid_indices, and use those indices to retrieve the data from your input array to compute the result. On the other paw, if the values to skip depend on both indices of the P array, I don't see how you can avoid some kind of lookup.
Back to parallelizing - No matter what, you'll be dealing with (N**2 - N)/2 computations
of f(). One possibility is to just accept that there will be contention for the sum
arrays, which would not be a big issue if computing f() takes substantially longer than
the two additions. When you get to very large numbers of parallel paths, contention will
again be an issue, but there should be a "sweet spot" balancing the number of parallel
paths against the time required to compute f().
If contention is still an issue, you can partition the problem several ways. One way is
to compute a row or column at a time: for a row at a time, each column sum can be
computed independently and a running total can be kept for each row sum.
Another approach would be to divide the data space and, thus, the computation into
subsets, where each subset has its own row and column sum arrays. After each block
is computed, the independent arrays can then be summed to produce the values you need
to compute the result.
This probably will be one of those naive and useless answers, but it also might help. Feel free to tell me that I'm utterly and completely wrong and I have misunderstood the whole affair.
So... here we go!
The Basic Problem
It seems to me that you can define you result function a little differently and it will lift at least some contention off your intermediate values. Let's suppose that your P matrix is lower-triangular. If you (virtually) fill the upper triangle with the negative of the lower values (and the main diagonal with all zeros,) then you can redefine each element of your result as the sum of a single row: (shown here for N=4, and where -i means the negative of the value in the cell marked as i)
P 0 1 2 3
|--------------------
0| x -0 -1 -3
|
1| 0 x -2 -4
|
2| 1 2 x -5
|
3| 3 4 5 x
If you launch independent threads (executing the same kernel) to calculate the sum of each row of this matrix, each thread will write a single result element. It seems that your problem size is large enough to saturate your hardware threads and keep them busy.
The caveat, of course, is that you'll be calculating each f(x, y) twice. I don't know how expensive that is, or how much the memory contention was costing you before, so I cannot judge whether this is a worthwhile trade-off to do or not. But unless f was really really expensive, I think it might be.
Skipping Values
You mention that you might have tens of thousands elements of the P matrix that you need to ignore in your calculations (effectively skip them.)
To work with the scheme I've proposed above, I believe you should store the skipped elements as (row, col) pairs, and you have to add the transposed of each coordinate pair too (so you'll have twice the number of skipped values.) So your example skip list of P[6], P[14] and P[18] becomes P(4,0), P(5,4), P(6,3) which then becomes P(4,0), P(5,4), P(6,3), P(0,4), P(4,5), P(3,6).
Then you sort this list, first based on row and then column. This makes our list to be P(0,4), P(3,6), P(4,0), P(4,5), P(5,4), P(6,3).
If each row of your virtual P matrix is processed by one thread (or a single instance of your kernel or whatever,) you can pass it the values it needs to skip. Personally, I would store all these in a big 1D array and just pass in the first and last index that each thread would need to look at (I would also not store the row indices in the final array that I passed in, since it can be implicitly inferred, but I think that's obvious.) In the example above, for N = 8, the begin and end pairs passed to each thread will be: (note that the end is one past the final value needed to be processed, just like STL, so an empty list is denoted by begin == end)
Thread 0: 0..1
Thread 1: 1..1 (or 0..0 or whatever)
Thread 2: 1..1
Thread 3: 1..2
Thread 4: 2..4
Thread 5: 4..5
Thread 6: 5..6
Thread 7: 6..6
Now, each thread goes on to calculate and sum all the intermediate values in a row. While it is stepping through the indices of columns, it is also stepping through this list of skipped values and skipping any column number that comes up in the list. This is obviously an efficient and simple operation (since the list is sorted by column too. It's like merging.)
Pseudo-Implementation
I don't know CUDA, but I have some experience working with OpenCL, and I imagine the interfaces are similar (since the hardware they are targeting are the same.) Here's an implementation of the kernel that does the processing for a row (i.e. calculates one entry of result) in pseudo-C++:
double calc_one_result (
unsigned my_id, unsigned N, double const init_values [],
unsigned skip_indices [], unsigned skip_begin, unsigned skip_end
)
{
double res = 0;
for (unsigned col = 0; col < my_id; ++col)
// "f" seems to take init_values[column] as its first arg
res += f (init_values[col], init_values[my_id]);
for (unsigned row = my_id + 1; row < N; ++row)
res -= f (init_values[my_id], init_values[row]);
// At this point, "res" is holding "result[my_id]",
// including the values that should have been skipped
unsigned i = skip_begin;
// The second condition is to check whether we have reached the
// middle of the virtual matrix or not
for (; i < skip_end && skip_indices[i] < my_id; ++i)
{
unsigned col = skip_indices[i];
res -= f (init_values[col], init_values[my_id]);
}
for (; i < skip_end; ++i)
{
unsigned row = skip_indices[i];
res += f (init_values[my_id], init_values[row]);
}
return res;
}
Note the following:
The semantics of init_values and function f are as described by the question.
This function calculates one entry in the result array; specifically, it calculates result[my_id], so you should launch N instances of this.
The only shared variable it writes to is result[my_id]. Well, the above function doesn't write to anything, but if you translate it to CUDA, I imagine you'd have to write to that at the end. However, no one else writes to that particular element of result, so this write will not cause any contention of data race.
The two input arrays, init_values and skipped_indices are shared among all the running instances of this function.
All accesses to data are linear and sequential, except for the skipped values, which I believe is unavoidable.
skipped_indices contain a list of indices that should be skipped in each row. It's contents and structure are as described above, with one small optimization. Since there was no need, I have removed the row numbers and left only the columns. The row number will be passed into the function as my_id anyways and the slice of the skipped_indices array that should be used by each invocation is determined using skip_begin and skip_end.
For the example above, the array that is passed into all invocations of calc_one_result will look like this:[4, 6, 0, 5, 4, 3].
As you can see, apart from the loops, the only conditional branch in this code is skip_indices[i] < my_id in the third for-loop. Although I believe this is innocuous and totally predictable, even this branch can be easily avoided in the code. We just need to pass in another parameter called skip_middle that tells us where the skipped items cross the main diagonal (i.e. for row #my_id, the index at skipped_indices[skip_middle] is the first that is larger than my_id.)
In Conclusion
I'm by no means an expert in CUDA and HPC. But if I have understood your problem correctly, I think this method might eliminate any and all contentions for memory. Also, I don't think this will cause any (more) numerical stability issues.
The cost of implementing this is:
Calling f twice as many times in total (and keeping track of when it is called for row < col so you can multiply the result by -1.)
Storing twice as many items in the list of skipped values. Since the size of this list is in the thousands (and not billions!) it shouldn't be much of a problem.
Sorting the list of skipped values; which again due to its size, should be no problem.
(UPDATE: Added the Pseudo-Implementation section.)

1D Convolution without if-else statements (non-FFT)?

I've written a simple serial 1D convolution function (below). I'm also experimenting with GPU convolution implementations. This is mostly for my own curiosity; I'm trying to learn the performance tradeoffs among various non-FFT implementation strategies.
Avoiding branching will be important for my GPU convolution experiments, since branching is expensive on Nvidia GPUs. One of my friends mentioned that there's a way to implement the code below without if/else statements, but he couldn't remember how it works.
How can I do a correct 1D convolution implementation without using any if/else statements?
Here's my basic 1D serial code in C++:
vector<int> myConv1d(vector<int> vec, vector<int> kernel)
{
int paddedLength = vec.size() + kernel.size() - 1;
vector<int> convolved(paddedLength); //zeros
reverse(kernel.begin(), kernel.end()); //flip the kernel (if we don't flip it, then we have correlation instead of convolution)
for(int outputIdx=0; outputIdx<paddedLength; outputIdx++) //index into 'convolved' vector
{
int vecIdx = outputIdx - kernel.size() + 1; //aligns with leftmost element of kernel
for(int kernelIdx=0; kernelIdx<kernel.size(); kernelIdx++)
{
if( (vecIdx+kernelIdx) >= 0 && (vecIdx+kernelIdx) < vec.size() ) //TODO: FIND A WAY TO REMOVE THIS
{
convolved[outputIdx] += kernel[kernelIdx]*vec[vecIdx+kernelIdx];
}
}
}
return convolved;
}
A couple of quick notes:
I did find some related posts, but I didn't quite understand the strategy avoiding conditional statements.
I've also written a 2D convolution implementation, and I'm hoping to apply the results of this SO post to the 2D version as well.
This is NOT homework. It's marginally related to one of our research projects, but it's mostly for the sake of learning.
Why don't you do something like this?
int lowerBound = std::max( 0, -vecIdx );
int upperBound = std::min( kernel.size(), vec.size() - vecIdx );
for( int kernelIdx = lowerBound; kernelIdx < upperBound; kernelIdx++ )
Sorry If I didn't understand the question.
Either zero-extend or border-extend the source vector to avoid the checks. If the source vector V is of size L and the kernel of size K, pad it by prepending and appending K-1 elements.
Let L = 5 and K = 3, you should end up with the padded vector
p p v v v v v q q
where the vs are the vector elements, the ps and qs the padding. Keep in mind that GPU toolkits should allow to clamp elements outside of a source vector to either 0 or to the border value - effectively making the above padding useless.

Long array performance issue

I have an array of char pointers of length 175,000. Each pointer points to a c-string array of length 100, each character is either 1 or 0. I need to compare the difference between the strings.
char* arr[175000];
So far, I have two for loops where I compare every string with every other string. The comparison functions basically take two c-strings and returns an integer which is the number of differences of the arrays.
This is taking really long on my 4-core machine. Last time I left it to run for 45min and it never finished executing. Please advise of a faster solution or some optimizations.
Example:
000010
000001
have a difference of 2 since the last two bits do not match.
After i calculate the difference i store the value in another array
int holder;
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = 0; y < UsedTableSpace; y++){
if(x != y){
//compr calculates difference between two c-string arrays
int tempDiff =compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
}
similarity[holder]->inbound++;
}
With more information, we could probably give you better advice, but based on what I understand of the question, here are some ideas:
Since you're using each character to represent a 1 or a 0, you're using several times more memory than you need to use, which creates a big performance impact when it comes to caching and such. Instead, represent your data using numeric values that you can think of in terms of a series of bits.
Once you've implemented #1, you can grab an entire integer or long at a time and do a bitwise XOR operation to end up with a number that has a 1 in every place where the two numbers didn't have the same values. Then you can use some of the tricks mentioned here to count these bits speedily.
Work on "unrolling" your loops somewhat to avoid the number of jumps necessary. For example, the following code:
total = total + array[i];
total = total + array[i + 1];
total = total + array[i + 2];
... will work faster than just looping over total = total + array[i] three times. Jumps are expensive, and interfere with the processor's pipelining. Update: I should mention that your compiler may be doing some of this for you already--you can check the compiled code to see.
Break your overall data set into chunks that will allow you to take full advantage of caching. Think of your problem as a "square" with the i index on one axis and the j axis on the other. If you start with one i and iterate across all 175000 j values, the first j values you visit will be gone from the cache by the time you get to the end of the line. On the other hand, if you take the top left corner and go from j=0 to 256, most of the values on the j axis will still be in a low-level cache as you loop around to compare them with i=0, 1, 2, etc.
Lastly, although this should go without saying, I guess it's worth mentioning: Make sure your compiler is set to optimize!
One simple optimization is to compare the strings only once. If the difference between A and B is 12, the difference between B and A is also 12. Your running time is going to drop almost half.
In code:
int compr(const char* a, const char* b) {
int d = 0, i;
for (i=0; i < 100; ++i)
if (a[i] != b[i]) ++d;
return d;
}
void main_function(...) {
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = x + 1; y < UsedTableSpace; y++){
//compr calculates difference between two c-string arrays
int tempDiff = compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
similarity[holder]->inbound++;
}
}
Notice the second-th for loop, I've changed the start index.
Some other optimizations is running the run method on separate threads to take advantage of your 4 cores.
What is your goal, i.e. what do you want to do with the Hamming Distances (which is what they are) after you've got them? For example, if you are looking for the closest pair, or most distant pair, you probably can get an O(n ln n) algorithm instead of the O(n^2) methods suggested so far. (At n=175000, n^2 is 15000 times larger than n ln n.)
For example, you could characterize each 100-bit number m by 8 4-bit numbers, being the number of bits set in 8 segments of m, and sort the resulting 32-bit signatures into ascending order. Signatures of the closest pair are likely to be nearby in the sorted list. It is easy to lower-bound the distance between two numbers if their signatures differ, giving an effective branch-and-bound process as less-distant numbers are found.