Min of array rows in CUDA - c++

Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!

Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.

Related

understanding Matrix multiplication in CUDA

I am trying to learn CUDA. I started to try matrix multiplication with the help of this article based on GPU.
My main problem is that I am unable too understand how to access 2D array in Kernel since accessing a 2D array is a bit different than the conventional method (matrix[i][j]).
This is the part where i am stuck:
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
C[ROW * N + COL] = tmpSum;
I could understand how ROW and COLUMN were derived.
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
Any explanation with an example is highly appreciated. Thanks!
Matrices are stored contiguously, i.e. every row after the other at consecutive locations. What you see here is called flat adressing, i.e turning the two element index to an offset from the first element.

Is Eigen library matrix/vector manipulation faster than .net ones if the matrix is dense and unsymmetrical?

I have some matrix operations, mostly dealing with operations like running over all the each of the rows and columns of the matrix and perform multiplication a*mat[i,j]*mat[ii,j]:
public double[] MaxSumFunction()
{
var maxSum= new double[vector.GetLength(1)];
for (int j = 0; j < matrix.GetLength(1); j++)
{
for (int i = 0; i < matrix.GetLength(0); i++)
{
for (int ii = 0; ii < matrix.GetLength(0); ii++)
{
double wi= Math.Sqrt(vector[i]);
double wii= Math.Sqrt(vector[ii]);
maxSum[j] += SomePowerFunctions(wi, wii) * matrix[i, j]*matrix[ii, j];
}
}
}
}
private double SomePowerFunctions(double wi, double wj)
{
var betaij = wi/ wj;
var numerator = 8 * Math.Sqrt(wi* wj) * Math.Pow(betaij, 3.0 / 2)
* (wi+ betaij * wj);
var dominator = Math.Pow(1 - betaij * betaij, 2) +
4 * wi* wj* betaij * (1 + Math.Pow(betaij, 2)) +
4 * (wi* wi+ wj* wj) * Math.Pow(betaij, 2);
if (wi== 0 && wj== 0)
{
if (Math.Abs(betaij - 1) < 1.0e-8)
return 1;
else
return 0;
}
return numerator / dominator;
}
I found such loops to be particularly slow if the matrix size is big.
I want the speed to be fast. So I am thinking about re-implementing these algorithms using the Eigen library.
My matrix is not symmetrical, not sparse and contains no regularity that any solver can exploit reliably.
I read that Eigen solver can be fast because of:
Compiler optimization
Vectorization
Multi-thread support
But I wonder those advantages are really applicable given my matrix characteristics?
Note: I could have just run a sample or two to find out, but I believe that asking the question here and have it documented on the Internet is going to help others as well.
Before thinking about low level optimizations, look at your code and observe that many quantities are recomputed many time. For instance, f(wi,wii) does not depend on j, so they could either be precomputed once (see below) or you can rewrite your loop to make the loop on j the nested one. Then the nested loop will simply be a coefficient wise product between a constant scalar and two columns of your matrix (I don't .net and assume j is indexing columns). If the storage if column-major, then this operation should be fully vectorized by your compiler (again, I don't know .net, but any C++ compiler will do, and if you Eigen, it will be vectorized explicitly). This should be enough to get a huge performance boost.
Depending on the sizes of matrix, you might also try to leverage optimized matrix-matrix implementation by precomputed f(wi,wii) into a MatrixXd F; (using Eigen's language), and then observe that the whole computation amount to:
VectorXd v = your_vector;
MatrixXd F = MatrixXd::nullaryExpr(n,n,[&](Index i,Index j) {
return SomePowerFunctions(sqrt(v(i)), sqrt(v(j)));
});
MatrixXd M = your_matrix;
MatrixXd FM = F * M;
VectorXd maxSum = (M.array() * FM.array()).colwise().sum();

How to optimize this math operation for speed

I'm trying to optimize a function taking a good chunk of execution time, which computes the following math operation many times. Is there anyway to make this operation faster?
float total = (sqrt(
((point_A[j].length)*(point_A[j].length))+
((point_B[j].width)*(point_B[j].width))+
((point_C[j].height)*(point_C[j].height))
));
If memory is cheap then you could do the following thereby improving CPU cache hit rate. Since you haven't posted more details, so I will make some assumptions here.
long tmp_len_square[N*3];
for (int j = 0; j < N; ++j) {
tmp_len_square[3 * j] = (point_A[j].length)*(point_A[j].length);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 1] = (point_B[j].width)*(point_B[j].width);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 2] = (point_C[j].height)*(point_C[j].height);
}
for (int j = 0; j < N; ++j) {
float total = sqrt(tmp_len_square[3 * j] +
tmp_len_square[(3 * j) + 1] +
tmp_len_square[(3 * j) + 2]);
// ...
}
Rearrange the data into this:
float *pointA_length;
float *pointB_width;
float *pointC_height;
That may require some level of butchering of your data structures, so you'll have to choose whether it's worth it or not.
Now what we can do is write this:
void process_points(float* Alengths, float* Bwidths, float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i++) {
output[i] = sqrt(Alengths[i] * Alengths[i] +
Bwidths[i] * Bwidths[i] +
Cheights[i] * Cheights[i]);
}
}
Writing it like this allows it to be auto-vectorized. For example, GCC targeting AVX and with -fno-math-errno -ftree-vectorize, can vectorize that loop. It does that with a lot of cruft though. __restrict__ and alignment attributes only improve that a little. So here's a hand-vectorized version as well: (not tested)
void process_points(float* Alengths,
float* Bwidths,
float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i += 8) {
__m256 a = _mm256_load_ps(Alengths + i);
__m256 b = _mm256_load_ps(Bwidths + i);
__m256 c = _mm256_load_ps(Cheights + i);
__m256 asq = _mm256_mul_ps(a, a);
__m256 sum = _mm256_fmadd_ps(c, c, _mm256_fmadd_ps(b, b, asq));
__m256 hsum = _mm256_mul_ps(sum, _mm256_set1_ps(0.5f));
__m256 invsqrt = _mm256_rsqrt_ps(sum);
__m256 s = _mm256_mul_ps(invsqrt, invsqrt);
invsqrt = _mm256_mul_ps(sum, _mm256_fnmadd_ps(hsum, s, _mm256_set1_ps(1.5f)));
_mm256_store_ps(output + i, _mm256_mul_ps(sum, invsqrt));
}
}
This makes a number of assumptions:
all pointers are 32-aligned.
n is a multiple of 8, or at least the buffers have enough padding that they're never accessed out of bounds.
the input buffers are not aliased with the output buffer (they could be aliased among themselves, but .. why)
the slightly reduced accuracy of the square root computed this way is OK (accurate to approximately 22 bits, instead of correctly rounded).
the sum of squares computed with fmadd can be slightly different than if it's computed using multiplies and adds, I assume that's OK too
your target supports AVX/FMA so this will actually run
The method for computing the square root I used here is using an approximate reciprocal square root, an improvement step (y = y * (1.5 - (0.5 * x * y * y))) and then a multiplication by x because x * 1/sqrt(x) = x/sqrt(x) = sqrt(x).
You can eventually try to optimize the sqrt function itself. May I suggest you to have a look at this link:
Best Square Root Method
Your question could be improved by adding a little more context. Is your code required to be portable, or are you targeting a particular compiler, or a specific processor or processor family? Perhaps you're willing to accept a general baseline version with target-specific optimised versions selected at runtime?
Also, there's very little context for the line of code you give. Is it in a tight loop? Or is it scattered in a bunch of places in conditional code in such a loop?
I'm going to assume that it's in a tight loop thus:
for (int j=0; j<total; ++j)
length[j] = sqrt(
(point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
I'm also going to assume that your target processor is multi-core, and that the arrays are distinct (or that the relevant elements are distinct), then an easy win is to annotate for OpenMP:
#pragma omp parallel for
for (int j=0; j<total; ++j)
length[j] = sqrt((point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
Compile with g++ -O3 -fopenmp -march=native (or substitute native with your desired target processor architecture).
If you know your target, you might be able to benefit from parallelisation of loops with the gcc flag -ftree-parallelize-loops=n - look in the manual.
Now measure your performance change (I'm assuming that you measured the original, given that this is an optimisation question). If it's still not fast enough for you, then it's time to consider changing your data structures, algorithms, or individual lines of code.

Advice on CUDA algorithm to sum columns of a matrix [duplicate]

Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2 loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1 loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.
CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.
CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.
Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.
Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.
Kernel Approach
Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.
Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.
For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by #Novak. This may improve the performance more, especially for matrices with bad shape.
Thrust Approach
General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.
However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.
cuBLAS Approach
Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.
y = A * ones(size(A,2),1);
where y is the sum of rows of A.
cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.
Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.
Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.
APPROACH #1 - reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 - transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 - inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 - cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.
THE FULL CODE
Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
// --- Required for approach #2
__device__ float *vals;
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {
const int Ncols; // --- Number of columns
row_reduction(int _Ncols) : Ncols(_Ncols) {}
__device__ float operator()(float& x, int& y ) {
float temp = 0.f;
for (int i = 0; i<Ncols; i++)
temp += vals[i + (y*Ncols)];
return temp;
}
};
/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
T C;
__host__ __device__ MulC(T c) : C(c) { }
__host__ __device__ T operator()(T x) { return x * C; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
TimingGPU timerGPU;
/***************/
/* APPROACH #1 */
/***************/
timerGPU.StartCounter();
// --- Allocate space for row sums and indices
thrust::device_vector<float> d_row_sums(Nrows);
thrust::device_vector<int> d_row_indices(Nrows);
// --- Compute row sums by summing values with equal row indices
//thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
// thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
// d_matrix.begin(),
// d_row_indices.begin(),
// d_row_sums.begin(),
// thrust::equal_to<int>(),
// thrust::plus<float>());
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
thrust::make_discard_iterator(),
d_row_sums.begin());
printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());
// --- Print result
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums[i] << "\n";
}
/***************/
/* APPROACH #2 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_2(Nrows, 0);
float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0), d_row_sums_2.begin(), row_reduction(Ncols));
printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_2[i] << "\n";
}
/***************/
/* APPROACH #3 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_3(Nrows, 0);
thrust::device_vector<float> d_temp(Nrows * Ncols);
thrust::inclusive_scan_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
d_temp.begin());
thrust::copy(
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
d_row_sums_3.begin());
printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_3[i] << "\n";
}
/***************/
/* APPROACH #4 */
/***************/
cublasHandle_t handle;
timerGPU.StartCounter();
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_row_sums_4(Nrows);
thrust::device_vector<float> d_ones(Ncols, 1.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));
printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_4[i] << "\n";
}
return 0;
}
TIMING RESULTS (tested on a Kepler K20c)
Matrix size #1 #1-v2 #2 #3 #4 #4 (no plan)
100 x 100 0.63 1.00 0.10 0.18 139.4 0.098
1000 x 1000 1.25 1.12 3.25 1.04 101.3 0.12
5000 x 5000 8.38 15.3 16.05 13.8 111.3 1.14
100 x 5000 1.25 1.52 2.92 1.75 101.2 0.40
5000 x 100 1.35 1.99 0.37 1.74 139.2 0.14
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.
If this is the extent (summing the rows) of the operations you need to do with this data, I wouldn't expect a sizable benefit from the GPU. You have exactly one arithmetic operation per data element, and for that you are paying the cost of transferring that data element to the GPU. And beyond a certain problem size (whatever it takes to keep the machine busy) you get no added benefit from larger problem sizes, because the arithmetic intensity is O(n).
So this isn't a particularly exciting problem to solve on the GPU.
But as talonmies has indicated, you have a coalescing problem in the way you have crafted it, which will further slow things down. Let's take a look at a small example:
C1 C2 C3 C4
R1 11 12 13 14
R2 21 22 23 24
R3 31 32 33 34
R4 41 42 43 44
Above is a simple pictorial example of a small portion of your matrix. The machine data storage is such that elements (11), (12), (13), and (14) are stored in adjacent memory locations.
For coalesced access, we want an access pattern such that adjacent memory locations are requested from the same instruction, executed across the warp.
We need to think about execution of your code from the standpoint of a warp, that is 32 threads executing in lock-step. What is your code doing? Which elements is it retrieving (asking for) at each step/instruction? Let's take a look at this line of code:
sum+=m[rowIdx*ncol+k];
Adjacent threads in the warp have adjacent (i.e. consecutive) values for rowIdx as you have created that variable. So when k = 0, which data element is being asked for by each thread when we try to retrieve the value m[rowIdx*ncol+k] ?
In block 0, thread 0 has a rowIdx of 0. Thread 1 has a rowIdx of 1, etc. So the values being asked for by each thread at this instruction are:
Thread: Memory Location: Matrix Element:
0 m[0] (11)
1 m[ncol] (21)
2 m[2*ncol] (31)
3 m[3*ncol] (41)
But this is not coalesced access! Elements (11), (21), etc. are not adjacent in memory. For coalesced access, we would like that Matrix Element row to read like this:
Thread: Memory Location: Matrix Element:
0 m[?] (11)
1 m[?] (12)
2 m[?] (13)
3 m[?] (14)
If you then work backwards to determine what the value of ? should be, you will come up with an instruction something like this:
sum+=m[k*ncol+rowIdx];
This will give coalesced access, but it will not give you the correct answer, because we are now summing matrix columns instead of matrix rows. We can fix this by re-organizing your data storage to be in column-major order rather than row-major order. (You should be able to google that for ideas, right?) Conceptually, this is equivalent to transposing your matrix m. Whether this is convenient for you to do or not is outside the scope of your question, as I see it, and not really a CUDA issue. It may be a simple thing for you to do as you are creating the matrix on the host or transferring the matrix from host to device. But in summary, I don't know of a way to sum the matrix rows with 100% coalesced access, if the matrix is stored in row-major order. (You could resort to a sequence of row-reductions but that looks painful to me.)
It's not uncommon, when we are thinking about ways to accelerate code on the GPU, to consider re-organizing our data storage to facilitate the GPU. This is one example.
And, yes, what I'm outlining here still retains a loop in the kernel.
As an additional comment, I would suggest timing the data copy portions, and kernel (compute) portions separately. I can't tell from your question whether you are timing just the kernel or the entire (GPU) operation, including the data copies. If you time the data copies separately, you may discover that just the data copy time exceeds your CPU time. Any effort put into optimizing your CUDA code will not affect the data copy time. This might be a useful data point before you spend much time on this.

Is something like this possible in CUDA

Let's say, I have a matrix with values of 0 or 1. It is in CUDA possible, to do something like this:
__global__ void kernel(float *matrix, float *count)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= MATRIXSIZE || column >= MATRIXSIZE)
{
return;
}
if (matrix[MATRIXSIZE * row + column] == 1)
{
count[0]++;
}
}
So I get in the end ne number of ones in the matrix. I know, this is very simple example, but if this might be possible, so also other variants ...
There are highly optimized libraries for CUDA that perform these types of operations, called reductions. Look into CUDA Thrust or CUB. In Thrust, you can use reduce to sum up all the values or count to count number of instances of a particular value.
If you really want to do this. You should use atomic add: atomicadd. atomicadd(count[0],1)
But this may cause performance issue.