How to improve cwiseProduct operations? - c++

I have this function that is called several times in my code:
void Grid::computeFVarsSigma(const int DFAType,
const Matrix& D_sigma,
const Matrix& Phi,
const Matrix& DPhiDx,
const Matrix& DPhiDy,
const Matrix& DPhiDz,
Matrix& Rho,
Matrix& DRhoDx,
Matrix& DRhoDy,
Matrix& DRhoDz)
{
// auto PhiD = Phi * D_sigma;
Rho = ((Phi * D_sigma).cwiseProduct(Phi)).rowwise().sum();
if (DFAType == 1)
{
DRhoDx = 2. * ((Phi * D_sigma).cwiseProduct(DPhiDx)).rowwise().sum();
DRhoDy = 2. * ((Phi * D_sigma).cwiseProduct(DPhiDy)).rowwise().sum();
DRhoDz = 2. * ((Phi * D_sigma).cwiseProduct(DPhiDz)).rowwise().sum();
}
}
For the use case I took to benchmark, the input arrays have the following dimensions:
D_sigma 42 42
Phi 402264 42
DPhiDx 402264 42
DPhiDy 402264 42
DPhiDz 402264 42
The average time when this function is called 12 times is 0.621 seconds, measured with std::chrono::high_resolution_clock. I'm running these calculations in a AMD Ryzen5 compiled with g++ 7.5.0. I can bump the compilers version but I'm most interested for now in code optimizations.
One idea that I'd like to explore is to store the cwiseProduct computations of DRhoDx, DRhoDy and DRhoDz directly in a 3xNGridPoints Matrix. However, I don't know how to do it yet.
Are there any other manipulations that I could try to improve this function?
Thank in advance for your comments.
I would like to thanks #chatz and #Homer512 for very nice suggestions. I was very happy with the one-liner optimization proposed by #chatz however, #Homer512 suggestions' had a drastic change in performance as shown in the figure below (special thanks to #Homer512!). I will certainly use both suggestions as a starting point to improve other parts of my code.
Note, I'm using double and in the figure below return param and return
tuple stand for the same function returning the output as a tuple and as parameters, respectively.

I'll do this optimization in steps. First we establish a base value.
You didn't give a type definition for your Matrix type. So I define it as Eigen::MatrixXf. Also, just for my own sanity, I redefine the various Rho vectors as such. Note that Eigen occasionally has optimized code paths for vectors compared to matrices that just happen to be vectors. So doing this is a good idea anyway plus it makes reading the code easier.
using Matrix = Eigen::MatrixXf;
using Vector = Eigen::VectorXf;
namespace {
void compute(const Matrix& Phi, const Matrix& D_sigma, const Matrix& DPhi,
float factor, Vector& Rho)
{
Rho = (Phi * D_sigma).cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */
void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
Vector& DRhoDz)
{
compute(Phi, D_sigma, Phi, 1.f, Rho);
if (DFAType == 1) {
compute(Phi, D_sigma, DPhiDx, 2.f, DRhoDx);
compute(Phi, D_sigma, DPhiDy, 2.f, DRhoDy);
compute(Phi, D_sigma, DPhiDz, 2.f, DRhoDz);
}
}
The first optimization, as proposed by #chtz, is to cache the matrix multiplication. Don't use auto for this, as noted in Eigen's documentation.
namespace {
void compute(const Matrix& PhiD, const Matrix& DPhi, float factor, Vector& Rho)
{
Rho = PhiD.cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */
void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
Vector& DRhoDz)
{
const Matrix PhiD = Phi * D_sigma;
compute(PhiD, Phi, 1.f, Rho);
if (DFAType == 1) {
compute(PhiD, DPhiDx, 2.f, DRhoDx);
compute(PhiD, DPhiDy, 2.f, DRhoDy);
compute(PhiD, DPhiDz, 2.f, DRhoDz);
}
}
This is now 3.15 times as fast on my system.
The second step is to reduce the amount of memory required by doing the operation blockwise. The idea is pretty simple: We are somewhat constrained by memory bandwidth, especially since the matrix-matrix product is rather "thin". Plus it helps with the step after this.
Here I pick a block size of 384 rows. My rule of thumb is that the inputs and outputs should fit into the L2 cache (128-256 kiB, possibly shared by 2 threads) and that it should be a multiple of 16 for good vectorization across the board. 384 rows * 42 columns * 4 byte per float = 64 kiB. Adjust as required for other scalar types but from my tests it is actually not very sensitive.
Take care to use Eigen::Ref or appropriate templates to avoid copies, as I did here in the compute helper function.
namespace {
void compute(const Matrix& PhiD, const Eigen::Ref<const Matrix>& DPhi,
float factor, Eigen::Ref<Vector> Rho)
{
Rho = PhiD.cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */
void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
Vector& DRhoDz)
{
const Eigen::Index n = Phi.rows(), blocksize = 384;
Rho.resize(n);
if(DFAType == 1)
for(Vector* vec: {&DRhoDx, &DRhoDy, &DRhoDz})
vec->resize(n);
Matrix PhiD;
for(Eigen::Index i = 0; i < n; i += blocksize) {
const Eigen::Index cur = std::min(blocksize, n - i);
PhiD.noalias() = Phi.middleRows(i, cur) * D_sigma;
compute(PhiD, Phi.middleRows(i, cur), 1.f, Rho.segment(i, cur));
if (DFAType == 1) {
compute(PhiD, DPhiDx.middleRows(i, cur), 2.f,
DRhoDx.segment(i, cur));
compute(PhiD, DPhiDy.middleRows(i, cur), 2.f,
DRhoDy.segment(i, cur));
compute(PhiD, DPhiDz.middleRows(i, cur), 2.f,
DRhoDz.segment(i, cur));
}
}
}
This is another speedup by a factor of 1.75.
Now that we have this, we can parallelize very easily. Eigen can parallelize the matrix-matrix multiplication internally but not the rest so we do it all externally. The blockwise version works better because it can keep all threads busy all the time and it makes better use of the combined L2 cache capacity of the system. Compile with -fopenmp
namespace {
void compute(const Matrix& PhiD, const Eigen::Ref<const Matrix>& DPhi,
float factor, Eigen::Ref<Vector> Rho)
{
Rho = PhiD.cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */
void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
Vector& DRhoDz)
{
const Eigen::Index n = Phi.rows(), blocksize = 384;
Rho.resize(n);
if(DFAType == 1)
for(Vector* vec: {&DRhoDx, &DRhoDy, &DRhoDz})
vec->resize(n);
# pragma omp parallel
{
Matrix PhiD;
# pragma omp for nowait
for(Eigen::Index i = 0; i < n; i += blocksize) {
const Eigen::Index cur = std::min(blocksize, n - i);
PhiD.noalias() = Phi.middleRows(i, cur) * D_sigma;
compute(PhiD, Phi.middleRows(i, cur), 1.f, Rho.segment(i, cur));
if (DFAType == 1) {
compute(PhiD, DPhiDx.middleRows(i, cur), 2.f,
DRhoDx.segment(i, cur));
compute(PhiD, DPhiDy.middleRows(i, cur), 2.f,
DRhoDy.segment(i, cur));
compute(PhiD, DPhiDz.middleRows(i, cur), 2.f,
DRhoDz.segment(i, cur));
}
}
}
}
Interestingly this doesn't produce a huge benefit on my system, only a factor of 1.25 with 8 cores / 16 threads. I have not investigated what's the actual bottleneck. I guess it's my main memory bandwidth. A system with lower per-core bandwidth and/or higher per-node bandwidth (Xeons, Threadrippers) may benefit more.
One last proposal, but that is situational: Transpose the Phi and DPhiDx/y/z matrices. This allows two further optimizations for column-major matrices such as those used by Eigen:
General matrix-matrix multiplications are fastest when they are written in the pattern A.transpose() * B. Transposing the elements in Phi allows us to write PhiD = D_sigma.transpose() * Phi
Column-wise reductions are faster than row-wise except for very small number of columns such as in MatrixX4f
namespace {
void compute(const Matrix& PhiD, const Eigen::Ref<const Matrix>& DPhi,
float factor, Eigen::Ref<Vector> Rho)
{
Rho = PhiD.cwiseProduct(DPhi).colwise().sum() * factor;
}
} /* namespace anonymous */
void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
Vector& DRhoDz)
{
const Eigen::Index n = Phi.cols(), blocksize = 384;
Rho.resize(n);
if(DFAType == 1)
for(Vector* vec: {&DRhoDx, &DRhoDy, &DRhoDz})
vec->resize(n);
# pragma omp parallel
{
Matrix PhiD;
# pragma omp for nowait
for(Eigen::Index i = 0; i < n; i += blocksize) {
const Eigen::Index cur = std::min(blocksize, n - i);
PhiD.noalias() = D_sigma.transpose() * Phi.middleCols(i, cur);
compute(PhiD, Phi.middleCols(i, cur), 1.f, Rho.segment(i, cur));
if (DFAType == 1) {
compute(PhiD, DPhiDx.middleCols(i, cur), 2.f,
DRhoDx.segment(i, cur));
compute(PhiD, DPhiDy.middleCols(i, cur), 2.f,
DRhoDy.segment(i, cur));
compute(PhiD, DPhiDz.middleCols(i, cur), 2.f,
DRhoDz.segment(i, cur));
}
}
}
}
This brings another speedup by a factor of 1.14. I would assume some greater advantage if the inner dimension grows from 42 to something closer to 100 or 1000 and also if the bottleneck above is not so pronounced.
Improvement through decomposition
There is a neat trick you can apply for the (Phi * D_sigma).cwiseProduct(Phi).rowwise().sum() case:
Let p be a row vector of Phi, S be D_sigma and d be the scalar result for this one row. Then what we compute is
d = p * S * p'
If S is positive semidefinite, we can use an LDLT decomposition:
S = P' * L * D * L' * P
into the permutation matrix P, a lower triangular matrix L and a diagonal matrix D.
From this follows:
d = p * P' * L * D * L' * P * p'
d = (p * P') * (L * sqrt(D)) * (sqrt(D) * L') * (P * p')
d = ||(P * p) * (L * sqrt(D))||^2
The (P * p) is a simple permutation. The (L * sqrt(D)) is another fast and simple operation since D is just a diagonal matrix. The final multiplication of the (P * p) vector with the (L * sqrt(D)) matrix is also cheaper than before because L is a triangular matrix. So you can use Eigen's triangularView<Eigen::Lower> to save operations.
Since the decomposition may fail, you have to provide the original approach as a fall-back.

Let M=402264, N=42, then in your case the Phi*D_sigma product takes M*N² FMA operations, the cwiseProduct with the sum M*N FMA operations. You can safe some significant work, if you compute Phi * D_sigma only once, but you need to actually evaluate the result, e.g.
Matrix PhiD = Phi * D_sigma; // DO NOT USE `auto` HERE!
Rho = PhiD.cwiseProduct(Phi).rowwise().sum();
if(...) // etc

Related

How to perform an elementwise addition in an Eigen sparse matrix

I have the following code where AMAT is currrently a dense matrix. However most of the elements are zero such that essentially it is a sparse matrix. I understand block operations are not supported in Eigen sparse matrix. Wondering how I can rewrite this code if I replace AMAT as a sparse matrix. BMAT is a 9x9 dense matrix and every 3x3 block of BMAT is added to specific blocks in AMAT. BMAT is calculated outside this loop.
for(j=0;j<5000;j++) {
id1 = ids(0,j);
id2 = ids(1,j);
id3 = ids(2,j);
AMAT.block<3,3>(id1*3,id1*3) = AMAT.block<3,3>(id1*3,id1*3) + BMAT.block<3,3>(0,0);
AMAT.block<3,3>(id1*3,id2*3) = AMAT.block<3,3>(id1*3,id2*3) + BMAT.block<3,3>(0,3);
AMAT.block<3,3>(id1*3,id3*3) = AMAT.block<3,3>(id1*3,id3*3) + BMAT.block<3,3>(0,6);
AMAT.block<3,3>(id2*3,id1*3) = AMAT.block<3,3>(id2*3,id1*3) + BMAT.block<3,3>(3,0);
AMAT.block<3,3>(id2*3,id2*3) = AMAT.block<3,3>(id2*3,id2*3) + BMAT.block<3,3>(3,3);
AMAT.block<3,3>(id2*3,id3*3) = AMAT.block<3,3>(id2*3,id3*3) + BMAT.block<3,3>(3,6);
AMAT.block<3,3>(id3*3,id1*3) = AMAT.block<3,3>(id3*3,id1*3) + BMAT.block<3,3>(6,0);
AMAT.block<3,3>(id3*3,id2*3) = AMAT.block<3,3>(id3*3,id2*3) + BMAT.block<3,3>(6,3);
AMAT.block<3,3>(id3*3,id3*3) = AMAT.block<3,3>(id3*3,id3*3) + BMAT.block<3,3>(6,6);
}
This could work (untested, and I don't know the actual types of your matrices). The idea is to write a custom iterator which provides the indices and values of every entry of AMAT and pass that to setFromTriplets (duplicate entries will be summed together). This will iterate twice through your index list, but unfortunately not exploit the block structure of AMAT. But it will execute in O(nnz) time.
#include <Eigen/SparseCore>
struct AMAT_constructor {
struct AMAT_iterator {
bool operator==(AMAT_iterator const& other) const {
return j == other.j && k == other.k;
}
bool operator!=(AMAT_iterator const& other) const {
return !(*this == other);
}
Eigen::Index operator-(AMAT_iterator const& other) const {
return (j - other.j) * 81 + k - other.k;
}
AMAT_iterator const* operator->() const { return this; }
AMAT_iterator const& operator*() const { return *this; }
float value() const { return BMAT(k); }
Eigen::Index row() const { return ids((k / 3) % 3, j) * 3 + k % 3; }
Eigen::Index col() const { return ids(k / 27, j) * 3 + (k / 9) % 3; }
AMAT_iterator& operator++() {
if (++k == 81) {
k = 0;
++j;
}
return *this;
}
Eigen::Index j, k;
Eigen::Matrix3Xi const& ids;
Eigen::Matrix<float, 9, 9> const& BMAT;
};
Eigen::Matrix3Xi const& ids;
Eigen::Matrix<float, 9, 9> const& BMAT;
AMAT_iterator begin() const { return AMAT_iterator{0, 0, ids, BMAT}; }
AMAT_iterator end() const { return AMAT_iterator{ids.cols(), 0, ids, BMAT}; }
};
// use it like this:
Eigen::SparseMatrix<float> foo(Eigen::Matrix3Xi const& ids,
Eigen::Matrix<float, 9, 9> const& BMAT,
Eigen::Index sizeA) {
Eigen::SparseMatrix<float> AMAT(sizeA, sizeA);
AMAT_constructor Ac{ids, BMAT};
AMAT.setFromTriplets(Ac.begin(), Ac.end());
return AMAT;
}
As I understand this tutorial in https://eigen.tuxfamily.org/dox/group__TutorialBlockOperations.html block-operations are possible. But you need to know the columns and rows at compile-time.

no matching function for call in rcpp

When using Rcpp,I create a function named rpois_rcpp and l try to call it below in genDataList function, an error occurs and said :
"no matching function for call to 'cpprbinom',
candidate function not viable: no known conversion from 'arma::vec' (aka 'Col') to 'Rcpp::NumericVector' (aka 'Vector<14>') for 3rd argument
arma::vec cpprbinom(int n, double size, NumericVector prob).
Can someone help me ,thanks!
Here is my code:
//create a random matrix X with covariance matrix sigma
// [[Rcpp::export]]
arma::mat mvrnormArma(const int n, arma::vec mu, const int p, const
double rho) {
arma::mat sigma(p, p, arma::fill::zeros);
for (int i = 0; i < sigma.n_rows; ++i) {
for (int j = 0; j < sigma.n_cols; ++j) {
sigma(i,j) = pow(rho, abs((i + 1) - (j + 1)));
}
}
int ncols = sigma.n_cols;
arma::mat Y = arma::randn(n, ncols);
return arma::repmat(mu, 1, n).t() + Y * arma::chol(sigma);
}
//create a vector sampled from poisson distribution with mean vector
//lambda
// [[Rcpp::export]]
arma::vec rpois_rcpp( NumericVector &lambda) {
int n= lambda.length();
unsigned int lambda_i = 0;
IntegerVector sim(n);
for (unsigned int i = 0; i < n; i++) {
sim[i] = R::rpois(lambda[lambda_i]);
// update lambda_i to match next realized value with correct mean
lambda_i++;
}
return as<arma::vec>(sim);
}
//create a vector sampled from binomial distribution with probability
vector prob
// [[Rcpp::export]]
arma::vec cpprbinom(int n, double size, NumericVector prob) {
NumericVector v = no_init(n);
std::transform( prob.begin(), prob.end(), v.begin(), [=](double p){
return R::rbinom(size, p); });
return as<arma::vec>(v);}
// [[Rcpp::export]]44
List genDataList(int n, arma::vec& mu, int p, double rho,
arma::vec& beta, const double SNR, const std::string &
Test_case) {
arma::mat U, V, data, normData, Projection;
arma::vec s, y, means, noise;
data = mvrnormArma(n, mu, p, rho);
normData = arma::normalise(data,2,0);
arma::svd_econ(U,s,V,normData,"right");
Projection = V * trans(V);
beta = Projection * beta;
if(Test_case == "gaussian")
{
means=normData * beta;
y = means + arma::randn(n) * sqrt(arma::var(means) / SNR);}
else if (Test_case == "poisson")
{
means=exp(normData * beta);
y = rpois_rcpp(means);}
else
{
means=exp(normData * beta)/(1 + exp(normData * beta));
y = cpprbinom(n,1,means);}
List ret;
ret["data"] = data;
ret["normData"] = normData;
ret["V"] = V;
ret["beta"] = beta;
ret["y"] = y;
return ret;
}
Thanks for adding your code. When I tried to compile, I got the same error as you, but also an error for the line calling rpois_rcpp()
invalid initialization of reference to type 'Rcpp::NumericVector&'
Pretty much everything seems to be in arma, except the R bindings and calls to the R:: namespace, which takes doubles, ints, etc. It seems the easiest thing to do (to my mind), is just take arma::vec as arguments instead:
arma::vec rpois_rcpp( arma::vec &lambda) {
int n= lambda.n_elem;
and
arma::vec cpprbinom(int n, double size, arma::vec prob) {
You never utilize the fact that lambda and prob are Rcpp::NumericVectors specifically, you just use doubles from them, so this seems the easiest route to me. After those changes, your code compiles fine on my machine. I don't have any test cases to make sure they run as you'd expect, but I imagine they will.

Convert Eigen::SparseMatrix to cuSparse and vice versa

I am having trouble figuring out how to convert Eigen::SparseMatrix to cuSparse due to how little documentation and examples are online. For dense matrices, converting from Eigen to CUDA for cublas is fairly straight forward
Eigen::MatrixXd A = Eigen::MatrixXd::Identity(3,3);
double *d_A;
cudaMalloc(reinterpret_cast<void **>(&d_A), 3 * 3 * sizeof(double));
cudaMemcpy(d_A, A.data(), sizeof(double) * 3 * 3, cudaMemcpyHostToDevice);
// do cublas operations on d_A
How to do the equivalent for the sparse matrices?
std::vector<Eigen::Triplet<double>> trip;
trip.emplace_back(0, 0, 1);
trip.emplace_back(1, 1, 1);
trip.emplace_back(2, 2, 1);
Eigen::SparseMatrix<double> A(3, 3);
A.setFromTriplets(trip.begin(), trip.end());
double *d_A;
// cudaMalloc?
// cudaMemcpy? some conversion?
// do cusparse operations
Just in case people are interested, I figured it out. The tricky part is Eigen's sparse matrix is in CSC format, whereas cuSparse is in CSR format. Fortunately, the conversion can be done by simply transpose CSC into CSR.
void EigenSparseToCuSparseTranspose(
const Eigen::SparseMatrix<double> &mat, int *row, int *col, double *val)
{
const int num_non0 = mat.nonZeros();
const int num_outer = mat.cols() + 1;
cudaMemcpy(row,
mat.outerIndexPtr(),
sizeof(int) * num_outer,
cudaMemcpyHostToDevice);
cudaMemcpy(
col, mat.innerIndexPtr(), sizeof(int) * num_non0, cudaMemcpyHostToDevice);
cudaMemcpy(
val, mat.valuePtr(), sizeof(double) * num_non0, cudaMemcpyHostToDevice);
}
void CuSparseTransposeToEigenSparse(
const int *row,
const int *col,
const double *val,
const int num_non0,
const int mat_row,
const int mat_col,
Eigen::SparseMatrix<double> &mat)
{
std::vector<int> outer(mat_col + 1);
std::vector<int> inner(num_non0);
std::vector<double> value(num_non0);
cudaMemcpy(
outer.data(), row, sizeof(int) * (mat_col + 1), cudaMemcpyDeviceToHost);
cudaMemcpy(inner.data(), col, sizeof(int) * num_non0, cudaMemcpyDeviceToHost);
cudaMemcpy(
value.data(), val, sizeof(double) * num_non0, cudaMemcpyDeviceToHost);
Eigen::Map<Eigen::SparseMatrix<double>> mat_map(
mat_row, mat_col, num_non0, outer.data(), inner.data(), value.data());
mat = mat_map.eval();
}

Cuda Matrix Example Block Size

I just started learning CUDA and I have been looking at examples on NVIDIA's website. Specifically, I have implemented the non-shared version of the matrix multiply (the first sample is the non-shared version even though it is in the shared memory section):
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
I am having a problem with the output when I change the block sizes. NVIDIA's code has a default block size of 16 and this gives me the correct output when I multiply two matrices. However, if I change the block size to anything above 16 (while still being a multiple of 16), I get an output of zero for all elements in the matrix. I tested this on my laptop too and noticed the same results for anything over 32 rather than 16. Could someone explain what is happening? I have two 9800GTX+ video cards in SLI and so I should have a maximum block size of (512,512,1). Why can I only do 16?
Also, I am noticing the same behavior in the shared version of the matrix multiplication (also on the NVIDIA page).
I didn't post my code because I get the same problem if I directly copy the code from the NVIDIA site.
I would really appreciate any help with this or with resources to learn more about these kinds of CUDA details.
Thank you!
I have attached the code as requested:
#include "stdio.h"
#include <cuda.h>
#include <assert.h>
#include <time.h>
#include <math.h>
// This is an example CUDA program that compares the timings of a matrix multiplication.
// The comparisons are between the CPU, GPU, and the GPU with shared memory.
#define BLOCK_SIZE 32
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
typedef void (*FuncPtr)(Matrix& A, Matrix& B, Matrix& C);
void multiplyMatrix(Matrix& A, Matrix& B, Matrix& C);
// Helper declarations
void initializeMatrix(Matrix& A, int rows, int cols, float val);
void copyMatrix(Matrix& dest, Matrix& src);
void freeMatrix(Matrix& A);
void printError(cudaError_t err);
void printMat(Matrix& A);
void setVal(Matrix& A, float val);
double applyMultFunc(FuncPtr func, Matrix& A, Matrix& B, Matrix& C, int numOfIters);
// CUDA declarations
__global__ void cudaMultMat(Matrix A, Matrix B, Matrix C);
int main() {
printf("Beginning Matrix Multiplication Comparison\n");
// Initialize matrix
Matrix A, B, C;
int rowsA = 32;
int colsA = 32;
int colsB = 32;
initializeMatrix(A, rowsA, colsA, 5.0f);
initializeMatrix(B, colsA, colsB, 2.0f);
initializeMatrix(C, rowsA, colsB, 0.0f);
// C = A * B using CPU, GPU, and GPU with shared memory
FuncPtr gpuMatMult = &multiplyMatrix;
int numOfIterations = 100;
double multTime = applyMultFunc(gpuMatMult, A, B, C, numOfIterations);
printMat(C);
// Update user
printf("Normal Mat Mult Time: %f\n", multTime);
// Cleanup
freeMatrix(A);
freeMatrix(B);
freeMatrix(C);
printf("\nPress Enter to continue...\n");
getchar();
return 0;
}
void multiplyMatrix(Matrix& A, Matrix& B, Matrix& C) {
// Initialize device matrices
Matrix deviceA, deviceB, deviceC;
copyMatrix(deviceA, A);
copyMatrix(deviceB, B);
copyMatrix(deviceC, C);
// Initialize number of blocks and threads
dim3 numOfThreadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);
int xSize = (C.width + numOfThreadsPerBlock.x - 1) / numOfThreadsPerBlock.x;
int ySize = (C.height + numOfThreadsPerBlock.y - 1) / numOfThreadsPerBlock.y;
dim3 numOfBlocks(xSize, ySize);
// Call CUDA kernel
cudaMultMat<<<numOfBlocks, numOfThreadsPerBlock>>>(deviceA, deviceB, deviceC);
printError(cudaThreadSynchronize());
printError(cudaMemcpy(C.elements, deviceC.elements, C.height * C.width * sizeof(float), cudaMemcpyDeviceToHost));
// Free cuda memory
printError(cudaFree(deviceA.elements));
printError(cudaFree(deviceB.elements));
printError(cudaFree(deviceC.elements));
}
// CUDA definitions
// GPU matrix multiplication (non-shared memory)
__global__ void cudaMultMat(Matrix A, Matrix B, Matrix C) {
// If the matrices are of the wrong size then return
if(A.width != B.height) {
return;
}
// Initialize the indexes into the grid
int col = (blockDim.x * blockIdx.x) + threadIdx.x;
int row = (blockDim.y * blockIdx.y) + threadIdx.y;
// Initialize the result
float cVal = 0.0f;
// Find the result for the dot product of a row of A and a column of B
for(int i = 0; i < A.width; i++) {
cVal += A.elements[row * A.width + i] * B.elements[i * B.width + col];
}
// If we are in bounds then save the result
if(row < C.height && col < C.width) {
C.elements[row * C.width + col] = cVal;
}
}
// Helper functions
void initializeMatrix(Matrix& A, int rows, int cols, float val) {
A.width = cols;
A.height = rows;
A.stride = A.width;
int numOfElements = A.width * A.height;
A.elements = (float*) malloc(numOfElements * sizeof(float));
for(int i = 0; i < numOfElements; i++) {
A.elements[i] = val;
}
}
void copyMatrix(Matrix& dest, Matrix& src) {
dest.width = src.width;
dest.height = src.height;
dest.stride = src.stride;
int size = src.width * src.height * sizeof(float);
printError(cudaMalloc(&dest.elements, size));
printError(cudaMemcpy(dest.elements, src.elements, size, cudaMemcpyHostToDevice));
}
void freeMatrix(Matrix& A) {
free(A.elements);
}
void printError(cudaError_t err) {
if(err != 0) {
printf("CUDA ERROR: %s\n", cudaGetErrorString(err));
getchar();
}
}
void printMat(Matrix& A) {
printf("*********************************\n");
for(int i = 0; i < A.height; i++) {
for(int j = 0; j < A.width; j++) {
int index = i * A.width + j;
printf("%2.1f, ", A.elements[index]);
}
printf("\n");
}
}
void setVal(Matrix& A, float val) {
for(int i = 0; i < A.width * A.height; i++) {
A.elements[i] = val;
}
}
double applyMultFunc(FuncPtr func, Matrix& A, Matrix& B, Matrix& C, int numOfIters) {
clock_t startTime = clock();
for(int i = 0; i < numOfIters; i++) {
func(A, B, C);
}
clock_t endTime = clock();
return (double) (endTime - startTime) / CLOCKS_PER_SEC;
}
You're exceeding the threads per block specification of your GPU when you increase the block sizes.
The 9800GTX has a limit of 512 threads per block, regardless of how you create the block. 16*16 = 256 which is OK. 32 x 32 = 1024 which is not OK. In this case the kernel fails to run and so the output is not correct.
Your laptop probably has a newer GPU which supports 1024 threads per block, so 32 x 32 is OK but anything larger is not.
If you add proper cuda error checking to the code you can confirm this. Note that this code appears to have cuda error checking, but the checking implemented on the kernel call is incoomplete. Study the link I gave and you will see the difference. If you modify the code with complete error checking, you will see the error.
if your GPU's compute capability is 1.0/1.1, you can have at most 512 threads per block. But in new GPU device, every block can have at most 1024 threads.

Multiplying with parenthesis result using overloaded * operator in C++

So I've been writing a small math library in C++ and when dealing with a scalar multiplied by a vector I get some issues when I try to perform this operation
Vect V2;
Vect V3;
float S;
Vect V1 = V2 + S * (V2 - V3);
The Vect Value I receive in the overloaded operator * is a new Vect object and not the outcome of the (V2 - V3) part of the operation. Here's the other relevant part of the code. If I follow the operation with the debugging the two overloaded operators work correctly by themselves but not one after the other.
Vect.h
Vect &operator - (const Vect &inVect);
friend const Vect operator*(const float scale, const Vect &inVect);
Vect.cpp
Vect &Vect::operator - (const Vect &inVect)
{
return Vect (this->x - inVect.x,this->y - inVect.y,this->z - inVect.z, 1);
}
const Vect operator*(const float scale, const Vect &inVect)
{
Vect result;
result.x = InVect.x * scale;
result.z = InVect.y * scale;
result.y = InVect.z * scale;
result.w = 1;
return result;
}
I also overloaded the + and = operator and they work as expected the only problem I encounter is the problem above.
In your operator-, you create a temporary Vect and return a reference to that temporary. The temporary is destroyed at the end of the return statement and the returned reference is left dangling. Doing anything with this reference will result in undefined behaviour.
Instead, your operator- should return a Vect by value:
Vect Vect::operator - (const Vect &inVect)
{
return Vect (this->x - inVect.x,this->y - inVect.y,this->z - inVect.z, 1);
}