Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2 loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1 loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.
CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.
CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.
Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.
Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.
Kernel Approach
Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.
Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.
For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by #Novak. This may improve the performance more, especially for matrices with bad shape.
Thrust Approach
General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.
However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.
cuBLAS Approach
Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.
y = A * ones(size(A,2),1);
where y is the sum of rows of A.
cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.
Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.
Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.
APPROACH #1 - reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 - transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 - inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 - cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.
THE FULL CODE
Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
// --- Required for approach #2
__device__ float *vals;
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {
const int Ncols; // --- Number of columns
row_reduction(int _Ncols) : Ncols(_Ncols) {}
__device__ float operator()(float& x, int& y ) {
float temp = 0.f;
for (int i = 0; i<Ncols; i++)
temp += vals[i + (y*Ncols)];
return temp;
}
};
/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
T C;
__host__ __device__ MulC(T c) : C(c) { }
__host__ __device__ T operator()(T x) { return x * C; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
TimingGPU timerGPU;
/***************/
/* APPROACH #1 */
/***************/
timerGPU.StartCounter();
// --- Allocate space for row sums and indices
thrust::device_vector<float> d_row_sums(Nrows);
thrust::device_vector<int> d_row_indices(Nrows);
// --- Compute row sums by summing values with equal row indices
//thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
// thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
// d_matrix.begin(),
// d_row_indices.begin(),
// d_row_sums.begin(),
// thrust::equal_to<int>(),
// thrust::plus<float>());
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
thrust::make_discard_iterator(),
d_row_sums.begin());
printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());
// --- Print result
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums[i] << "\n";
}
/***************/
/* APPROACH #2 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_2(Nrows, 0);
float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0), d_row_sums_2.begin(), row_reduction(Ncols));
printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_2[i] << "\n";
}
/***************/
/* APPROACH #3 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_3(Nrows, 0);
thrust::device_vector<float> d_temp(Nrows * Ncols);
thrust::inclusive_scan_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
d_temp.begin());
thrust::copy(
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
d_row_sums_3.begin());
printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_3[i] << "\n";
}
/***************/
/* APPROACH #4 */
/***************/
cublasHandle_t handle;
timerGPU.StartCounter();
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_row_sums_4(Nrows);
thrust::device_vector<float> d_ones(Ncols, 1.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));
printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_4[i] << "\n";
}
return 0;
}
TIMING RESULTS (tested on a Kepler K20c)
Matrix size #1 #1-v2 #2 #3 #4 #4 (no plan)
100 x 100 0.63 1.00 0.10 0.18 139.4 0.098
1000 x 1000 1.25 1.12 3.25 1.04 101.3 0.12
5000 x 5000 8.38 15.3 16.05 13.8 111.3 1.14
100 x 5000 1.25 1.52 2.92 1.75 101.2 0.40
5000 x 100 1.35 1.99 0.37 1.74 139.2 0.14
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.
If this is the extent (summing the rows) of the operations you need to do with this data, I wouldn't expect a sizable benefit from the GPU. You have exactly one arithmetic operation per data element, and for that you are paying the cost of transferring that data element to the GPU. And beyond a certain problem size (whatever it takes to keep the machine busy) you get no added benefit from larger problem sizes, because the arithmetic intensity is O(n).
So this isn't a particularly exciting problem to solve on the GPU.
But as talonmies has indicated, you have a coalescing problem in the way you have crafted it, which will further slow things down. Let's take a look at a small example:
C1 C2 C3 C4
R1 11 12 13 14
R2 21 22 23 24
R3 31 32 33 34
R4 41 42 43 44
Above is a simple pictorial example of a small portion of your matrix. The machine data storage is such that elements (11), (12), (13), and (14) are stored in adjacent memory locations.
For coalesced access, we want an access pattern such that adjacent memory locations are requested from the same instruction, executed across the warp.
We need to think about execution of your code from the standpoint of a warp, that is 32 threads executing in lock-step. What is your code doing? Which elements is it retrieving (asking for) at each step/instruction? Let's take a look at this line of code:
sum+=m[rowIdx*ncol+k];
Adjacent threads in the warp have adjacent (i.e. consecutive) values for rowIdx as you have created that variable. So when k = 0, which data element is being asked for by each thread when we try to retrieve the value m[rowIdx*ncol+k] ?
In block 0, thread 0 has a rowIdx of 0. Thread 1 has a rowIdx of 1, etc. So the values being asked for by each thread at this instruction are:
Thread: Memory Location: Matrix Element:
0 m[0] (11)
1 m[ncol] (21)
2 m[2*ncol] (31)
3 m[3*ncol] (41)
But this is not coalesced access! Elements (11), (21), etc. are not adjacent in memory. For coalesced access, we would like that Matrix Element row to read like this:
Thread: Memory Location: Matrix Element:
0 m[?] (11)
1 m[?] (12)
2 m[?] (13)
3 m[?] (14)
If you then work backwards to determine what the value of ? should be, you will come up with an instruction something like this:
sum+=m[k*ncol+rowIdx];
This will give coalesced access, but it will not give you the correct answer, because we are now summing matrix columns instead of matrix rows. We can fix this by re-organizing your data storage to be in column-major order rather than row-major order. (You should be able to google that for ideas, right?) Conceptually, this is equivalent to transposing your matrix m. Whether this is convenient for you to do or not is outside the scope of your question, as I see it, and not really a CUDA issue. It may be a simple thing for you to do as you are creating the matrix on the host or transferring the matrix from host to device. But in summary, I don't know of a way to sum the matrix rows with 100% coalesced access, if the matrix is stored in row-major order. (You could resort to a sequence of row-reductions but that looks painful to me.)
It's not uncommon, when we are thinking about ways to accelerate code on the GPU, to consider re-organizing our data storage to facilitate the GPU. This is one example.
And, yes, what I'm outlining here still retains a loop in the kernel.
As an additional comment, I would suggest timing the data copy portions, and kernel (compute) portions separately. I can't tell from your question whether you are timing just the kernel or the entire (GPU) operation, including the data copies. If you time the data copies separately, you may discover that just the data copy time exceeds your CPU time. Any effort put into optimizing your CUDA code will not affect the data copy time. This might be a useful data point before you spend much time on this.
Related
I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.
First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.
As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}
Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.
According to the docs, Eigen supports multi-threaded dense v. dense, or row-major sparse v. dense, but not sparse v. sparse matrix multiplications. In my application I have 2 sparse matrices and I want to multiply them in parallel, i.e. obtain C = A * B where C and B are column major.
[Edit:] Specific to my application I also know the sparsity pattern of all matrices in advance, and I need to perform such matrix multiplication many times (each time updating the non-zero values of C using the updated A and B). This leads me to think that there should be a thread-safe way to perform this operation in this case since the memory allocation of C is fixed [end edit]
Given that Eigen gives fast read/write on column blocks for column-major sparse matrices, I can perform the operation by subsetting B into column blocks (suppose A, B and C have all been initialized and are of the correct size):
int nchunks = 7;
int chunksize = B.cols() / nchunks;
#pragma omp parallel for
for(int i=0; i<nchunks; i++){
if(i == nchunks-1){
C.rightCols(B.cols() - (nchunks-1)*chunksize) = A * B.rightCols(B.cols() - (nchunks-1)*chunksize);
} else {
C.middleCols(i*chunksize, chunksize) = A * B.middleCols(i*chunksize, chunksize);
}
}
In my application, I have to perform A.transpose() * S * A thousands of times where S is symmetric positive definite, and I can split this operation into 2 steps each done in parallel. The result must be symmetric. By doing so I get a significant speedup.
To my non-expert eyes, there should be no issue because all threads share C but each operates on a different set of columns. In fact I expect my application to be extremely sensitive to even tiny issues because it needs to maintain symmetry after running 2 such sparse matrix multiplications.
However, this code works on Linux (Ubuntu 18.04, with R 3.6.1 linked to Intel MKL), but not on Windows (I've tried with both vanilla R 3.6.1 and MRO 3.5.3). My application fails in some cases on Windows due to a failed Cholesky, in turn caused by C not being created correctly, which I've traced to the above chunk of code. Simply removing OMP solves the issue, of course at the cost of performance.
I have not been able to reproduce this consistently on Windows: in isolation, the product works correctly. But it is definitely an OMP issue because removing the parallel clause solves the issue.
So what's going on?
Here's a working piece of code
//[[Rcpp::export]]
Rcpp::List spmat_mult(const Eigen::SparseMatrix<double, Eigen::RowMajor>& A,
const Eigen::SparseMatrix<double>& B){
int nchunks = 7;
int chunksize = B.cols() / nchunks;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
Eigen::SparseMatrix<double> Cs(A.rows(), B.cols());
Cs = A * B;
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
clog << "Standard "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< "us.\n";
Eigen::SparseMatrix<double> C(A.rows(), B.cols());
C = Cs;
start = std::chrono::steady_clock::now();
#pragma omp parallel for
for(int i=0; i<nchunks; i++){
if(i == nchunks-1){
C.rightCols(B.cols() - (nchunks-1)*chunksize) = A * B.rightCols(B.cols() - (nchunks-1)*chunksize);
} else {
C.middleCols(i*chunksize, chunksize) = A * B.middleCols(i*chunksize, chunksize);
}
}
end = std::chrono::steady_clock::now();
clog << "OMP "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< "us.\n";
return Rcpp::List::create(
Rcpp::Named("Cs") = Cs,
Rcpp::Named("C") = C
);
}
and on R
library(Matrix)
set.seed(1)
n <- 5000 # choose size
A <- B <- matrix(0, ncol=n, nrow=n)
nnz <- 250
A_rows <- sample(1:n, nnz, replace=F)
A_cols <- sample(1:n, nnz, replace=F)
B_rows <- sample(1:n, nnz, replace=F)
B_cols <- sample(1:n, nnz, replace=F)
A[A_rows, A_cols] <- apply(A[A_rows, A_cols], 1:2, function(x) runif(1))
B[B_rows, B_cols] <- apply(B[B_rows, B_cols], 1:2, function(x) runif(1))
A <- as(A, "dgRMatrix")
B <- as(B, "dgCMatrix")
C <- spmat_mult(A, B)
I have a performance problem: CPU and GPU performances are almost the same.
The Problem I Dealing with is PATCH MATCH. I Have 2 Matrices. I want to find where is the maximum similarity between the big matrix and the small one.
The Matrices has Binary values 0/1 (Black and White).
When I am checking a match between a small matrix to a big one with i5 CPU, it takes 30ms (using multithreading).
When I am checking a match between a small matrix to a big one in a Ge-force GT 730, it takes also 33ms.
I would expect that The GPU will work faster in at least 1 magnitude of order. I pretty disappointed from my current results.
I have two matrices:
1) Big - 300000 (300 rows, 1000 columns)
2) Small 50000 (50 rows, 1000 columns)
The comparing process is done by dividing the big matrix into 250 sub matrices and then comparing each one to the small matrix, then find highest similarity.
The Similarity criterion is the sum of corresponding black pixels on both matrices (the small and the sub-big) divided by the sum of black pixels on sub-big.
I did the last task using the following CUDA code:
__global__ void matCompare_cuda (uint8_t *D_SUB , uint8_t *D_SMALL , float *D_RSLTS , unsigned int step, int numOfIndentations ,int SUB_size, int SMALL_size)
{
int i = 0 , j = 0 , success = 0, sumZero = 0;
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int LoopIndex = ( tid * step );
if (tid < numOfIndentations)
{
for ( j = 0 ; j < (SMALL_size) ; j++)
{
i = j + LoopIndex;
if ( D_SUB[i] == 0 )
{
{
sumZero++;
if ( D_SMALL[j] == 0 )
success++;
}
}
}
if ( success > 0 && sumZero > 500)
D_RSLTS[tid] = 100*((float)success / sumZero) ;
}
}
The Kernal launch:
int numOfIndentations = 300-50 //[ (big.row) - (small.row)]
int numBlock = 16;
int threadNumber = numOfIndentations/numBlock;
matCompare_cuda<<< numBlock , threadNumber >>> ( D_SUB , D_SMALL , D_RSLTS , step, numOfIndentations, SUB_size, SMALL_size );
The Cpu Code:
for (i=0; i < (pixelNum) ; i++)
{
if (SUB[i]==0)
{
sumDots = sumDots +1;
if (SMALL->Image[i]==0)
{
success = success + 1;
}
}
}
if (success>0)
if (sumDots>500)
RSLT=((float)success/sumDots)*100;
Do you see any improvement that can be done in the GPU code?
A few things.
Try to avoid the if's if possible. You can write here:
sumZero += (1 - D_SUB[i])
success += (1 - D_SUB[i]) * (1 - D_SMALL[j])
However I don't think you're going to see a huge difference here. I see two reasons.
One is that there's overhead in invoking cuda. The data needs to be copied to the graphic card and back. That eats some of the speedup you get. Not sure how much it is, but since the run-time is so short it could play a role. I hope you didn't time the compilation of the kernel and other one-time things (take them out by running the code in a loop and ignoring the first few iterations).
Second your big matrix is too small and your small matrix is too big. Because the small matrix is so big (1000 columns) I'm guessing it plays really well with the CPU cache lines. If the small matrix were smaller you would have to go to the next line more often which would increase the chances of breaking the cache line. The gpu uses rectangles for caching so it wouldn't be a problem. If the big matrix were to be bigger you would also increase the amount of computation required so the GPU would start to get ahead.
Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!
Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.
I found the function that calculates the determinant of boost::ublas matrix:
template<typename ValType>
ValType det_fast(const ublas::matrix<ValType>& matrix)
{
// create a working copy of the input
ublas::matrix<ValType> mLu(matrix);
ublas::permutation_matrix<std::size_t> pivots(matrix.size1());
auto isSingular = ublas::lu_factorize(mLu, pivots);
if (isSingular)
return static_cast<ValType>(0);
ValType det = static_cast<ValType>(1);
for (std::size_t i = 0; i < pivots.size(); ++i)
{
if (pivots(i) != i)
det *= static_cast<ValType>(-1);
det *= mLu(i, i);
}
return det;
}
This function works fine, but only with non-integer types (it works fine with float and double). When I try to pass the same matrix but with type of int, I received compilation error:
Check failed in file c:\boost\boost_1_58_0\boost\numeric\ublas\lu.hpp at line 167:
singular != 0 || detail::expression_type_check (prod (triangular_adaptor (m), triangular_adaptor (m)), cm)
unknown location(0): fatal error in "BaseTest": std::logic_error: internal logic
It is the bug of boost or my function is wrong?
What change I could do for avoid this error?
Of course this is not a bug, since checks like this one is all over uBLAS. I guess this is because most of its operations would make no sense for non-float types.
You can disable type checks using
#define BOOST_UBLAS_TYPE_CHECK 0
before includes. But think twice! Take a look at example
#include <iostream>
#define BOOST_UBLAS_TYPE_CHECK 0
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/lu.hpp>
namespace ublas = boost::numeric::ublas;
template<typename ValType>
ValType det_fast(const ublas::matrix<ValType>& matrix)
{
// create a working copy of the input
ublas::matrix<ValType> mLu(matrix);
ublas::permutation_matrix<std::size_t> pivots(matrix.size1());
auto isSingular = ublas::lu_factorize(mLu, pivots);
if (isSingular)
return static_cast<ValType>(0);
ValType det = static_cast<ValType>(1);
for (std::size_t i = 0; i < pivots.size(); ++i)
{
if (pivots(i) != i)
det *= static_cast<ValType>(-1);
det *= mLu(i, i);
}
return det;
}
int main()
{
ublas::matrix<int> m (3, 3);
for (unsigned i = 0; i < m.size1 (); ++ i)
for (unsigned j = 0; j < m.size2 (); ++ j)
m (i, j) = 3 * i + j;
std::cout << det_fast(m) << std::endl;
}
It is obvious that matrix m is singular, and if you change type from int to double functions returns zero. But with int it returns -48.
EDIT #1
Also, you can create ublas::matrix<int> without any errors and assign it to ublas::matrix<float>. So one way to correctly calculate determinant is to overload det_fast and remove define statement.
int det_fast(const ublas::matrix<int>& matrix)
{
return (int)det_fast(ublas::matrix<float>(matrix));
}
EDIT #2
Take a look at the table, where average time is time of full determinant calculation (for int with copy operation) of 5 program runs.
size | average time, ms
------------------------
| int float
------------------------
100 | 9.0 9.0
200 | 46.6 46.8
300 | 156.4 159.4
400 | 337.4 331.4
500 | 590.8 609.2
600 | 928.2 1009.4
700 | 1493.8 1525.2
800 | 2162.8 2231.0
900 | 3184.2 3025.2
You might think that for int it is faster, but it's not true. In average for more runs I'm sure that you will get slight acceleration with float (I guess about 0-15 ms, which is time measure error). Also if we measure copy time we will see, that for matrix size less than 3000 x 3000 it is near zero (it is also about 0-15 ms, which is time measure error). For larger matrices copy time increases (for example for 5000 x 5000 matrix it is 125 ms). But there is one important note! Copy time is less than 1.0% of all determinant calculation for int type matrix and it is greatly decreasing with size grow!
All of this measures were made for a program compiled with Visual Studio 2013 in release configuration with boost version 1.58. Time were measured with clock function. CPU is Intel Xeon E5645 2.4GHz.
Also perfomance mostly depends on how your methods will be used. If you're going to call this function many times for small matrices than the copy time may become significant.