I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is
a b c d e f
g h i j k l
m n o p q r
I want the result be as follows:
a g m
b h n
c I o
d j p
e k q
f l r
What is the fastest way to do this?
This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.
First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)
void transpose(float *src, float *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*l+j];
C[K*i + j] = tmp;
That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*j+l];
C[K*i + j] = tmp;
Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.
I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.
For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is
Smear image horizontally
transpose output
Smear output horizontally
transpose output
Here is a paper by Intel explaining that
Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use
void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
#pragma omp parallel for
for(int n=0; n<M*N; n++) {
int k = vec_size*(n/N/vec_size);
int i = (n/vec_size)%N;
int j = n%vec_size;
B[n] = A[M*i + k + j];
I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).
inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<block_size; i++) {
for(int j=0; j<block_size; j++) {
B[j*ldb + i] = A[i*lda +j];
inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this
#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);
float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
For 3000x1001 this returns ldb = 3008 and lda = 1008
I found an even faster solution using SSE intrinsics:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.
Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It's helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.
_MM_TRANSPOSE4_PS(r0, r1, r2, r3) is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpd whereas MSVC uses only shufps. We can actually combine these two approaches together like this.
t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);
r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);
One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.
t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);
v = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);
This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances.
Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.
I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.
I used this same technique for a 8x8 float transpose (see towards the end of that answer). In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.
For 32-bit integers there is nothing like shufps (except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don't think can be convert to blends (efficiently). With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.
If the size of the arrays are known prior then we could use the union to our help. Like this-
#include <bits/stdc++.h>
using namespace std;
union ua{
int arr[2][3];
int brr[3][2];
int main() {
union ua uav;
int karr[2][3] = {{1,2,3},{4,5,6}};
for (int i=0;i<3;i++)
for (int j=0;j<2;j++)
cout<<uav.brr[i][j]<<" ";
return 0;
Consider each row as a column, and each column as a row .. use j,i instead of i,j
#include <iostream>
using namespace std;
int main ()
char A [3][3] =
{ 'a', 'b', 'c' },
{ 'd', 'e', 'f' },
{ 'g', 'h', 'i' }
cout << "A = " << endl << endl;
// print matrix A
for (int i=0; i<3; i++)
for (int j=0; j<3; j++) cout << A[i][j];
cout << endl;
cout << endl << "A transpose = " << endl << endl;
// print A transpose
for (int i=0; i<3; i++)
for (int j=0; j<3; j++) cout << A[j][i];
cout << endl;
return 0;
transposing without any overhead (class not complete):
class Matrix{
double *data; //suppose this will point to data
double _get1(int i, int j){return data[i*M+j];} //used to access normally
double _get2(int i, int j){return data[j*N+i];} //used when transposed
int M, N; //dimensions
double (*get_p)(int, int); //functor to access elements
Matrix(int _M,int _N):M(_M), N(_N){
//allocate data
get_p=&Matrix::_get1; // initialised with normal access
double get(int i, int j){
//there should be a way to directly use get_p to call. but i think even this
//doesnt incur overhead because it is inline and the compiler should be intelligent
//enough to remove the extra call
return (this->*get_p)(i,j);
void transpose(){ //twice transpose gives the original
if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
else get_p==&Matrix::_get1;
can be used like this:
Matrix M(100,200);
double x=M.get(17,45);
x=M.get(17,45); // = original M(45,17)
of course I didn't bother with the memory management here, which is crucial but different topic.
template <class T>
void transpose( const std::vector< std::vector<T> > & a,
std::vector< std::vector<T> > & b,
int width, int height)
for (int i = 0; i < width; i++)
for (int j = 0; j < height; j++)
b[j][i] = a[i][j];
Modern linear algebra libraries include optimized versions of the most common operations. Many of them include dynamic CPU dispatch, which chooses the best implementation for the hardware at program execution time (without compromising on portability).
This is commonly a better alternative to performing manual optimization of your functinos via vector extensions intrinsic functions. The latter will tie your implementation to a particular hardware vendor and model: if you decide to swap to a different vendor (e.g. Power, ARM) or to a newer vector extensions (e.g. AVX512), you will need to re-implement it again to get the most of them.
MKL transposition, for example, includes the BLAS extensions function imatcopy. You can find it in other implementations such as OpenBLAS as well:
#include <mkl.h>
void transpose( float* a, int n, int m ) {
const char row_major = 'R';
const char transpose = 'T';
const float alpha = 1.0f;
mkl_simatcopy (row_major, transpose, n, m, alpha, a, n, n);
For a C++ project, you can make use of the Armadillo C++:
#include <armadillo>
void transpose( arma::mat &matrix ) {
intel mkl suggests in-place and out-of-place transposition/copying matrices. here is the link to the documentation. I would recommend trying out of place implementation as faster ten in-place and into the documentation of the latest version of mkl contains some mistakes.
I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)
my answer is transposed of 3x3 matrix
int a[3][3];
int b[3];
cout<<"You must give us an array 3x3 and then we will give you Transposed it "<<endl;
for(int i=0;i<3;i++)
for(int j=0;j<3;j++)
cout<<"Enter a["<<i<<"]["<<j<<"]: ";
cout<<"Matrix you entered is :"<<endl;
for (int e = 0 ; e < 3 ; e++ )
for ( int f = 0 ; f < 3 ; f++ )
cout << a[e][f] << "\t";
cout << endl;
cout<<"\nTransposed of matrix you entered is :"<<endl;
for (int c = 0 ; c < 3 ; c++ )
for ( int d = 0 ; d < 3 ; d++ )
cout << a[d][c] << "\t";
cout << endl;
return 0;
I'm trying to use cudaMemcpy to a std::vector::data to an array for a device kernel and it gives set fault error. The way I do it is:
cudaMemcpy(d_x,, N*sizeof(float), cudaMemcpyHostToDevice);
where vx is vector. The following is the complete example. Any hints on where the problem are would be appreciated.
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
// Kernel function to add the elements of two arrays
void add(int n, float *x, float *y)
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i < n) {
y[i] = x[i] + y[i];
int main(void)
int N = 1<<10;
float *d_x = NULL, *d_y = NULL;
cudaMalloc((void **)&d_x, sizeof(float)*N);
cudaMalloc((void **)&d_y, sizeof(float)*N);
// Allocate Unified Memory – accessible from CPU or GPU
vector<float> vx;
vector<float> vy;
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
cudaMemcpy(d_x,, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y,, N*sizeof(float), cudaMemcpyHostToDevice);
int blockSize; // The launch configurator returned block size
int minGridSize; // The minimum grid size needed to achieve the
// maximum occupancy for a full device launch
int gridSize; // The actual grid size needed, based on input size
cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, add, 0, N);
// Round up according to array size
gridSize = (N + blockSize - 1) / blockSize;
cout<<"blockSize: "<<blockSize<<" minGridSize: "<<minGridSize<<" gridSize: "<<gridSize<<endl;
// calculate theoretical occupancy
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor( &maxActiveBlocks, add, blockSize, 0);
int device;
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
float occupancy = (maxActiveBlocks * blockSize / props.warpSize) /
(float)(props.maxThreadsPerMultiProcessor /
printf("Launched blocks of size %d. Theoretical occupancy: %f\n",
blockSize, occupancy);
// Run kernel on 1M elements on the GPU
add<<<gridSize, blockSize>>>(N, d_x, d_y);
// Wait for GPU to finish before accessing on host
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++) {
maxError = fmax(maxError, fabs(d_y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
return 0;
blockSize: 1024 minGridSize: 16 gridSize: 1
Launched blocks of size 1024. Theoretical occupancy: 1.000000
Segmentation fault (core dumped)
The problem is here:
for (int i = 0; i < N; i++) {
maxError = fmax(maxError, fabs(d_y[i]-3.0f));
And the reason is you cannot dereference device pointer on host.
The solution is copy device memory to host similar to what you did for host to device.
My Question:
I am looking for someone to either point out a mistake in the way I am attempting to use implement zero-copy in CUDA, or reveal a more 'behind the scenes' perspective to why the zero-copy method would not be faster than memcpy method. By the way, I am performing my tests on NVidia's TK1 processor, using Ubuntu.
My problem has to do with efficiently using NVIDIA TK1's (physically) unified memory architecture with CUDA. There are 2 methods NVIDIA provides for GPU/CPU memory transfer abstraction.
Unified Memory abstraction (using cudaHostAlloc & cudaHostGetDevicePointer)
Explicit copy to host, and from device (using cudaMalloc() & cudaMemcpy)
Short description of my test code: I test out the same cuda kernel using both methods 1 and 2. I expected 1 to be faster given that there is no copy to device of the source data or copy from device of the result data. However, results backwards to my assumption (method # 1 is 50% slower). Below is my code for this test:
#include <libfreenect/libfreenect.hpp>
#include <iostream>
#include <vector>
#include <cmath>
#include <pthread.h>
#include <cxcore.h>
#include <time.h>
#include <sys/time.h>
#include <memory.h>
#include <cuda.h>
#include <cuda_runtime.h>
///OpenCV 2.4
#include <highgui.h>
#include <cv.h>
#include <opencv2/gpu/gpu.hpp>
using namespace cv;
using namespace std;
///The Test Kernel///
__global__ void cudaCalcXYZ( float *dst, float *src, float *M, int height, int width, float scaleFactor, int minDistance)
float nx,ny,nz, nzpminD, jFactor;
int heightCenter = height / 2;
int widthCenter = width / 2;
//int j = blockIdx.x; //Represents which row we are in
int index = blockIdx.x*width;
jFactor = (blockIdx.x - heightCenter)*scaleFactor;
for(int i= 0; i < width; i++)
nz = src[index];
nzpminD = nz + minDistance;
nx = (i - widthCenter )*(nzpminD)*scaleFactor;
ny = (jFactor)*(nzpminD);
//Solve for only Y matrix (height vlaues)
dst[index++] = nx*M[4] + ny*M[5] + nz*M[6];
//dst[index++] = 1 + 2 + 3;
//Function fwd declarations
double getMillis();
double getMicros();
void runCudaTestZeroCopy(int iter, int cols, int rows);
void runCudaTestDeviceCopy(int iter, int cols, int rows);
int main(int argc, char **argv) {
//ZERO COPY FLAG (allows runCudaTestZeroCopy to run without fail)
//Runs kernel using explicit data copy to 'device' and back from 'device'
runCudaTestDeviceCopy(20, 640,480);
//Uses 'unified memory' cuda abstraction so device can directly work from host data
runCudaTestZeroCopy(20,640, 480);
std::cout << "Stopping test" << std::endl;
return 0;
void runCudaTestZeroCopy(int iter, int cols, int rows)
cout << "CUDA Test::ZEROCOPY" << endl;
int src_rows = rows;
int src_cols = cols;
int m_rows = 4;
int m_cols = 4;
int dst_rows = src_rows;
int dst_cols = src_cols;
//Create and allocate memory for host mats pointers
float *psrcMat;
float *pmMat;
float *pdstMat;
cudaHostAlloc((void **)&psrcMat, src_rows*src_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pmMat, m_rows*m_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pdstMat, dst_rows*dst_cols*sizeof(float), cudaHostAllocMapped);
//Create mats using host pointers
Mat src_mat = Mat(cvSize(src_cols, src_rows), CV_32FC1, psrcMat);
Mat m_mat = Mat(cvSize(m_cols, m_rows), CV_32FC1, pmMat);
Mat dst_mat = Mat(cvSize(dst_cols, dst_rows), CV_32FC1, pdstMat);
//configure src and m mats
for(int i = 0; i < src_rows*src_cols; i++)
psrcMat[i] = (float)i;
for(int i = 0; i < m_rows*m_cols; i++)
pmMat[i] = 0.1234;
//Create pointers to dev mats
float *d_psrcMat;
float *d_pmMat;
float *d_pdstMat;
//Map device to host pointers
cudaHostGetDevicePointer((void **)&d_psrcMat, (void *)psrcMat, 0);
//cudaHostGetDevicePointer((void **)&d_pmMat, (void *)pmMat, 0);
cudaHostGetDevicePointer((void **)&d_pdstMat, (void *)pdstMat, 0);
//Copy matrix M to device
cudaMalloc( (void **)&d_pmMat, sizeof(float)*4*4 ); //4x4 matrix
cudaMemcpy( d_pmMat, pmMat, sizeof(float)*m_rows*m_cols, cudaMemcpyHostToDevice);
//Additional Variables for kernels
float scaleFactor = 0.0021;
int minDistance = -10;
//Run kernel! //cudaSimpleMult( float *dst, float *src, float *M, int width, int height)
int blocks = src_rows;
const int numTests = iter;
double perfStart = getMillis();
for(int i = 0; i < numTests; i++)
//cudaSimpleMult<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_cols, src_rows);
cudaCalcXYZ<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_rows, src_cols, scaleFactor, minDistance);
double perfStop = getMillis();
double perfDelta = perfStop - perfStart;
cout << "Ran " << numTests << " iterations totaling " << perfDelta << "ms" << endl;
cout << " Average time per iteration: " << (perfDelta/(float)numTests) << "ms" << endl;
//Copy result back to host
//cudaMemcpy(pdstMat, d_pdstMat, sizeof(float)*src_rows*src_cols, cudaMemcpyDeviceToHost);
//cout << "Printing results" << endl;
//for(int i = 0; i < 16*16; i++)
// cout << "src[" << i << "]= " << psrcMat[i] << " dst[" << i << "]= " << pdstMat[i] << endl;
void runCudaTestDeviceCopy(int iter, int cols, int rows)
cout << "CUDA Test::DEVICE COPY" << endl;
int src_rows = rows;
int src_cols = cols;
int m_rows = 4;
int m_cols = 4;
int dst_rows = src_rows;
int dst_cols = src_cols;
//Create and allocate memory for host mats pointers
float *psrcMat;
float *pmMat;
float *pdstMat;
cudaHostAlloc((void **)&psrcMat, src_rows*src_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pmMat, m_rows*m_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pdstMat, dst_rows*dst_cols*sizeof(float), cudaHostAllocMapped);
//Create pointers to dev mats
float *d_psrcMat;
float *d_pmMat;
float *d_pdstMat;
cudaMalloc( (void **)&d_psrcMat, sizeof(float)*src_rows*src_cols );
cudaMalloc( (void **)&d_pdstMat, sizeof(float)*src_rows*src_cols );
cudaMalloc( (void **)&d_pmMat, sizeof(float)*4*4 ); //4x4 matrix
//Create mats using host pointers
Mat src_mat = Mat(cvSize(src_cols, src_rows), CV_32FC1, psrcMat);
Mat m_mat = Mat(cvSize(m_cols, m_rows), CV_32FC1, pmMat);
Mat dst_mat = Mat(cvSize(dst_cols, dst_rows), CV_32FC1, pdstMat);
//configure src and m mats
for(int i = 0; i < src_rows*src_cols; i++)
psrcMat[i] = (float)i;
for(int i = 0; i < m_rows*m_cols; i++)
pmMat[i] = 0.1234;
//Additional Variables for kernels
float scaleFactor = 0.0021;
int minDistance = -10;
//Run kernel! //cudaSimpleMult( float *dst, float *src, float *M, int width, int height)
int blocks = src_rows;
double perfStart = getMillis();
for(int i = 0; i < iter; i++)
//Copty from host to device
cudaMemcpy( d_psrcMat, psrcMat, sizeof(float)*src_rows*src_cols, cudaMemcpyHostToDevice);
cudaMemcpy( d_pmMat, pmMat, sizeof(float)*m_rows*m_cols, cudaMemcpyHostToDevice);
//Run Kernel
//cudaSimpleMult<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_cols, src_rows);
cudaCalcXYZ<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_rows, src_cols, scaleFactor, minDistance);
//Copy from device to host
cudaMemcpy( pdstMat, d_pdstMat, sizeof(float)*src_rows*src_cols, cudaMemcpyDeviceToHost);
double perfStop = getMillis();
double perfDelta = perfStop - perfStart;
cout << "Ran " << iter << " iterations totaling " << perfDelta << "ms" << endl;
cout << " Average time per iteration: " << (perfDelta/(float)iter) << "ms" << endl;
//Timing functions for performance measurements
double getMicros()
timespec ts;
//double t_ns, t_s;
long t_ns;
double t_s;
clock_gettime(CLOCK_MONOTONIC, &ts);
t_s = (double)ts.tv_sec;
t_ns = ts.tv_nsec;
//return( (t_s *1000.0 * 1000.0) + (double)(t_ns / 1000.0) );
return ((double)t_ns / 1000.0);
double getMillis()
timespec ts;
double t_ns, t_s;
clock_gettime(CLOCK_MONOTONIC, &ts);
t_s = (double)ts.tv_sec;
t_ns = (double)ts.tv_nsec;
return( (t_s * 1000.0) + (t_ns / 1000000.0) );
I have already seen the post Cuda zero-copy performance, but I feel this was not related for the following reason: The GPU and CPUs have a physically unified memory architecture.
When you are using ZeroCopy, the read to memory goes through some path where it queries the memory unit to fetch data from system memory. This operation has some latency.
When using direct access to memory, the memory unit gathers data from global memory, and has a different access pattern and latency.
Actually seeing this difference would require some form of profiling.
Nonetheless, your call to global function makes use of a single thread
cudaCalcXYZ<<< blocks,1 >>> (...
In this case, the GPU has little way to hide latency when memory is gathered from the system memory (or global memory). I would recommend you use more threads (some multiple of 64, at least 128 total), and run the profiler on it to get the cost of memory access. Your algorithm seems separable, and modifing the code from
for(int i= 0; i < width; i++)
for (int i = threadIdx.x ; i < width ; i += blockDim.x)
will probably increase performance overall.
Image size is 640 in width which will turn into 5 iterations of 128 threads.
cudaCalcXYZ<<< blocks,128 >>> (...
I believe it would result in some performance increase.
ZeroCopy feature allow us to running data on device without manually copy it to Device Memory like cudaMemcpy function. Zero copy memory only pass host address to device that read/wrote on kernel device. So, the more thread block you declaration to kernel device, the more data that read/wrote on kernel device, the more host address that passed to device. Finally, you got better performance gain than if you only declaration a few thread block to device kernel.
I am writing a CUDA application for Jetson TK1 using CUDA 6. I have got the impression from Mark Harris in his blog post
Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere
that the memory of the Tegra K1 is physically unified. I have also observed results indicating that cudaMallocManaged is significantly faster for global memory than ordinary cudaMemcpy. This is probably because the Unified Memory doesn't require any copying.
However, what do I do when I want to use the texture memory for parts of my application? I have not found any support for textures using cudaMallocManaged so I have assumed that I have to use normal cudaMemcpyToArray and bindTextureToArray?
Using the previous mentioned method often seem to work but the variables managed by cudaMallocManaged sometimes give weird segmentation faults for me. Is this the right way to use texture memory along with Unified Memory? The following code illustrates how I do it. This code works fine but my question is whether this is the right way to go or if it might create undefined behaviour that could cause e.g. segmentation faults.
#define width 16
#define height 16
texture<float, cudaTextureType2D, cudaReadModeElementType> input_tex;
__global__ void some_tex_kernel(float* output){
int i= threadIdx.x;
float x = i%width+0.5f;
float y = i/width+0.5f;
output[i] = tex2D(input_tex, x, y);
int main(){
float* out;
if(cudaMallocManaged(&out, width*height*sizeof(float))!= cudaSuccess)
std::cout << "unified not working\n";
for(int i=0; i< width*height; ++i){
out[i] = float(i);
const cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();
cudaArray* input_t;
cudaMallocArray(&input_t, &desc, width, height);
cudaMemcpyToArrayAsync(input_t, 0, 0, out, width*height*sizeof(float), cudaMemcpyHostToDevice);
input_tex.filterMode = cudaFilterModeLinear;
cudaBindTextureToArray(input_tex, input_t, desc);
some_tex_kernel<<<1, width*height>>>(out);
for(int i=0;i<width*height; ++i)
std::cout << out[i] << " ";
Another thing that I find odd is that if I remove the cudaDeviceSynchronize() in the code I always get segmentation faults. I understand that the result might not be finished if I read it without a synchronization but should not the variable still be accessible?
Anyone have a clue?
The only managed memory possibilities at this time are static allocations using __device__ __managed__ or dynamic allocations using cudaMallocManaged(). There is no direct support for textures, surfaces, constant memory, etc.
Your usage of texturing is fine. The only overlap between texture usage and managed memory is in the following call:
cudaMemcpyToArrayAsync(input_t, 0, 0, out, width*height*sizeof(float), cudaMemcpyHostToDevice);
where managed memory is the source (i.e. host side) of the transfer. This is acceptable as long as the call is issued during a period when no kernels are executing (see below).
"Another thing that I find odd is that if I remove the cudaDeviceSynchronize() in the code I always get segmentation faults."
cudaDeviceSynchronize(); is necessary after a kernel call to make the managed memory visible to the host again. I suggest you read this section of the documentation carefully:
"In general, it is not permitted for the CPU to access any managed allocations or variables while the GPU is active. Concurrent CPU/GPU accesses, ... will cause a segmentation fault..."
As you've indicated, the code you posted works fine. If you have other code that has unpredictable seg faults while using managed memory, I would carefully inspect the code flow (especially if you are using streams i.e. concurrency) to make sure that the host is accessing managed data only after a cudaDeviceSynchronize(); has been issued, and before any subsequent kernel calls.
Robert Crovella has already answered to your question. However, in order to show you that cudaMallocManaged can be used in the framework of texture memory, I have dusted my 1D linear interpolation code and converted it using cudaMallocManaged. You will see that the code performs the 1D linear interpolation in four different ways:
GPU using tex1Dfetch;
GPU using tex1D filtering.
The code works without problems in all the cases and, especially, the latter two ones, on a Kepler K20c card.
// includes, system
#include <cstdlib>
#include <conio.h>
#include <math.h>
#include <fstream>
#include <iostream>
#include <iomanip>
// includes, cuda
#include <cuda.h>
#include <cuda_runtime.h>
using namespace std;
texture<float, 1, cudaReadModeElementType> data_d_texture_filtering;
texture<float, 1> data_d_texture;
#define BLOCK_SIZE 256
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { getch(); exit(code); }
// --- Generates N equally spaced, increasing points between a and b and stores them in x
void linspace(float* x, float a, float b, int N) {
float delta_x=(b-a)/(float)N;
for(int k=1;k<N;k++) x[k]=x[k-1]+delta_x;
// --- Generates N randomly spaced, increasing points between a and b and stores them in x
void randspace(float* x, float a, float b, int N) {
float delta_x=(b-a)/(float)N;
for(int k=1;k<N;k++) x[k]=x[k-1]+delta_x+(((float)rand()/(float)RAND_MAX-0.5)*(1./(float)N));
// --- Generates N complex random data points, with real and imaginary parts ranging in (0.f,1.f)
void Data_Generator(float* data, int N) {
for(int k=0;k<N;k++) {
float linear_kernel_CPU(float in)
float d_y;
return 1.-abs(in);
void linear_interpolation_function_CPU(float* result_GPU, float* data, float* x_in, float* x_out, int M, int N){
float a;
for(int j=0; j<N; j++){
int k = floor(x_out[j]+M/2);
a = x_out[j]+M/2-floor(x_out[j]+M/2);
result_GPU[j] = a * data[k+1] + (-data[k] * a + data[k]);
__device__ float linear_kernel_GPU(float in)
float d_y;
return 1.-abs(in);
__global__ void linear_interpolation_kernel_function_GPU(float* __restrict__ result_d, const float* __restrict__ data_d, const float* __restrict__ x_out_d, const int M, const int N)
int j = threadIdx.x + blockDim.x * blockIdx.x;
float reg_x_out = x_out_d[j]+M/2;
int k = __float2int_rz(reg_x_out);
float a = reg_x_out - truncf(reg_x_out);
float dk = data_d[k];
float dkp1 = data_d[k+1];
result_d[j] = a * dkp1 + (-dk * a + dk);
__global__ void linear_interpolation_kernel_function_GPU_texture(float* __restrict__ result_d, const float* __restrict__ x_out_d, const int M, const int N)
int j = threadIdx.x + blockDim.x * blockIdx.x;
float reg_x_out = x_out_d[j]+M/2;
int k = __float2int_rz(reg_x_out);
float a = reg_x_out - truncf(reg_x_out);
float dk = tex1Dfetch(data_d_texture,k);
float dkp1 = tex1Dfetch(data_d_texture,k+1);
result_d[j] = a * dkp1 + (-dk * a + dk);
__global__ void linear_interpolation_kernel_function_GPU_texture_filtering(float* __restrict__ result_d, const float* __restrict__ x_out_d, const int M, const int N)
int j = threadIdx.x + blockDim.x * blockIdx.x;
if(j<N) result_d[j] = tex1D(data_d_texture_filtering,float(x_out_d[j]+M/2+0.5));
void linear_interpolation_function_GPU(float* result_d, float* data_d, float* x_in_d, float* x_out_d, int M, int N){
dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(N/BLOCK_SIZE + (N%BLOCK_SIZE == 0 ? 0:1),1);
linear_interpolation_kernel_function_GPU<<<dimGrid,dimBlock>>>(result_d, data_d, x_out_d, M, N);
void linear_interpolation_function_GPU_texture(float* result_d, float* data_d, float* x_in_d, float* x_out_d, int M, int N){
cudaBindTexture(NULL, data_d_texture, data_d, M*sizeof(float));
dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(N/BLOCK_SIZE + (N%BLOCK_SIZE == 0 ? 0:1),1);
linear_interpolation_kernel_function_GPU_texture<<<dimGrid,dimBlock>>>(result_d, x_out_d, M, N);
void linear_interpolation_function_GPU_texture_filtering(float* result_d, float* data, float* x_in_d, float* x_out_d, int M, int N){
cudaArray* data_d = NULL; gpuErrchk(cudaMallocArray(&data_d, &data_d_texture_filtering.channelDesc, M, 1));
gpuErrchk(cudaMemcpyToArray(data_d, 0, 0, data, sizeof(float)*M, cudaMemcpyHostToDevice));
gpuErrchk(cudaBindTextureToArray(data_d_texture_filtering, data_d));
data_d_texture_filtering.normalized = false;
data_d_texture_filtering.filterMode = cudaFilterModeLinear;
dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(N/BLOCK_SIZE + (N%BLOCK_SIZE == 0 ? 0:1),1);
linear_interpolation_kernel_function_GPU_texture_filtering<<<dimGrid,dimBlock>>>(result_d, x_out_d, M, N);
/* MAIN */
int main()
int M=1024; // --- Number of input points
int N=1024; // --- Number of output points
int Nit = 100; // --- Number of computations for time measurement
// --- Input sampling
float* x_in; gpuErrchk(cudaMallocManaged(&x_in,sizeof(float)*M));
// --- Input data
float *data; gpuErrchk(cudaMallocManaged(&data,(M+1)*sizeof(float))); Data_Generator(data,M); data[M]=0.;
// --- Output sampling
float* x_out; gpuErrchk(cudaMallocManaged((void**)&x_out,sizeof(float)*N)); randspace(x_out,-M/2.,M/2.,N);
// --- Result allocation
float *result_CPU; result_CPU=(float*)malloc(N*sizeof(float));
float *result_d; gpuErrchk(cudaMallocManaged(&result_d,sizeof(float)*N));
float *result_d_texture; gpuErrchk(cudaMallocManaged(&result_d_texture,sizeof(float)*N));
float *result_d_texture_filtering; gpuErrchk(cudaMallocManaged(&result_d_texture_filtering,sizeof(float)*N));
// --- Reference interpolation result as evaluated on the CPU
linear_interpolation_function_CPU(result_CPU, data, x_in, x_out, M, N);
float time;
cudaEvent_t start, stop;
cudaEventRecord(start, 0);
for (int k=0; k<Nit; k++) linear_interpolation_function_GPU(result_d, data, x_in, x_out, M, N);
cudaEventRecord(stop, 0);
cudaEventElapsedTime(&time, start, stop);
cout << "GPU Global memory [ms]: " << setprecision (10) << time/Nit << endl;
cudaEventRecord(start, 0);
for (int k=0; k<Nit; k++) linear_interpolation_function_GPU_texture_filtering(result_d_texture_filtering, data, x_in, x_out, M, N);
cudaEventRecord(stop, 0);
cudaEventElapsedTime(&time, start, stop);
cout << "GPU Texture filtering [ms]: " << setprecision (10) << time/Nit << endl;
cudaEventRecord(start, 0);
for (int k=0; k<Nit; k++) linear_interpolation_function_GPU_texture(result_d_texture, data, x_in, x_out, M, N);
cudaEventRecord(stop, 0);
cudaEventElapsedTime(&time, start, stop);
cout << "GPU Texture [ms]: " << setprecision (10) << time/Nit << endl;
float diff_norm=0.f, norm=0.f;
for(int j=0; j<N; j++) {
diff_norm = diff_norm + (result_CPU[j]-result_d[j])*(result_CPU[j]-result_d[j]);
norm = norm + result_CPU[j]*result_CPU[j];
printf("Error GPU [percentage] = %f\n",100.*sqrt(diff_norm/norm));
float diff_norm_texture_filtering=0.f;
for(int j=0; j<N; j++) {
diff_norm_texture_filtering = diff_norm_texture_filtering + (result_CPU[j]-result_d_texture_filtering[j])*(result_CPU[j]-result_d_texture_filtering[j]);
printf("Error texture filtering [percentage] = %f\n",100.*sqrt(diff_norm_texture_filtering/norm));
float diff_norm_texture=0.f;
for(int j=0; j<N; j++) {
diff_norm_texture = diff_norm_texture + (result_CPU[j]-result_d_texture[j])*(result_CPU[j]-result_d_texture[j]);
printf("Error texture [percentage] = %f\n",100.*sqrt(diff_norm_texture/norm));
return 0;
My system:
system specification:
Intel core2duo E4500 3700g memory L2 cache 2M x64 fedora 17
How I measure flops/mflops
well,I use papi library (to read hardware performance counter) to measure flops and mflops of my return real time procesing time, flops and finally flops/process time which is equal to MFLOPS.library use hardware counter to count floating point inststruction or floating point operations and Total cycle to get the final result that contain flops and MFLOPS.
MY computational kernel
I used three loop matrix matrix multiplication (square matrix) and three nested loop which do some operation on 1d array in its inner-loop.
First Kernel MM
float a[size][size];
float b[size][size];
float c[size][size];
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; **k+=1**) {
*c[i][j]=c[i][j]+a[i][k] * b[k][j];*
Second kernel with 1d array
float d[size];
float e[size];
float f[size];
float g[size];
float r = 3.6541;
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
what I know about flops
Matrix matrix Multiplication (MM) do 2 operation in its inner loop (here floating point operation) and as there is 3 loop which iterate for size X therefore in theory we have total flops of 2*n^3 for MM.
In second kernel we have 3 loop which in inner-most loop we have 1d array which do some computation.there is 4 floating point operation in this loop.hence we have total flops of 4*n^3 flops in theory
I know that the flops that we calculate above is not exactly the same as what will happen in real machine. In real machine there are other operation like load and store wich will add up to out theoretical flops.
Questions ?:
when I use 1d array as in second kernel theoretical flops is the
same or around the flops I get by executing the code and measuring
it.actually when I use 1d array flops is equal to # of operation in
inner-most loop multiply by n^3 but when I use my first kernel MM
which use 2d array theoretical flop is 2n^3 but when I run the code
,measured value is too much higher than theoretical value,it is
about 4+(2 operation in inner-most loop of matrix multiplication)*n^3+=6n^3.
I changed the matrix multiplication line in innermost loop with just the code below:
the theoretical flops for this code in 3 nested loop is 1 operation * n^3=n^3 again when I ran the code the result was too higher than what expected which was 2+(1 operation of inner-most loop)*n^3=3*n^3
Sample Results for matrix of size 512X512 :
Real_time: 1.718368 Proc_time: 1.227672 Total flpops:
807,107,072 MFLOPS: 657.429016
Real_time: 3.608078 Proc_time: 3.042272 Total flpops:
807,024,448 MFLOPS: 265.270355
theoretical flop: 2*512*512*512=268,435,456
Measured flops= 6*512^3 =807,107,072
Sample Result for 1d array operation in 3 nested loop
Real_time: 1.282257 Proc_time: 1.155990 Total flpops:
536,872,000 MFLOPS: 464.426117
theoretical flop: 4n^3 = 536,870,912
Measured flop: 4n^3=4*512^3+overheads(other operation?)=536,872,000
I could not find any reason for the aforementioned behaviour?
Is my assumption true ?
Hope to make it much simpler than before description.
By practical I meant real flop measured by executing the code.
void countFlops() {
int size = 512;
int itr = 20;
float a[size][size];
float b[size][size];
float c[size][size];
/* float d[size];
float e[size];
float f[size];
float g[size];*/
float r = 3.6541;
float real_time, proc_time, mflops;
long long flpops;
float ireal_time, iproc_time, imflops;
long long iflpops;
int retval;
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[j][j] = b[j][j] = c[j][j] = 1.0125;
/* for (int i = 0; i < size; ++i) {
if ((retval = PAPI_flops(&ireal_time, &iproc_time, &iflpops, &imflops))
< PAPI_OK) {
printf("Could not initialise PAPI_flops \n");
printf("Your platform may not support floating point operation event.\n");
printf("retval: %d\n", retval);
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; k+=16) {
c[i][j]=c[i][j]+a[i][k] * b[k][j];
/* for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
if ((retval = PAPI_flops(&real_time, &proc_time, &flpops, &mflops))
< PAPI_OK) {
printf("retval: %d\n", retval);
string flpops_tmp;
flpops_tmp = output_formatted_string(flpops);
"calculation: Real_time: %f Proc_time: %f Total flpops: %s MFLOPS: %f\n",
real_time, proc_time, flpops_tmp.c_str(), mflops);
thank you
If you need to count number of your operations - you can make simple class which acts like floating point value and gathers statistics. It will be interchangeable with builtin types.
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/operators.hpp>
#include <iostream>
#include <ostream>
#include <utility>
#include <cstddef>
#include <vector>
using namespace boost;
using namespace std;
class Statistic
size_t ops = 0;
Statistic &increment()
return *this;
size_t count() const
return ops;
template<typename Domain>
class Profiled: field_operators<Profiled<Domain>>
Domain value;
static vector<Statistic> stat;
void stat_increment()
struct StatisticScope
Statistic ¤t()
return stat.back();
template<typename ...Args>
Profiled(Args&& ...args)
: value{forward<Args>(args)...}
Profiled& operator+=(const Profiled& x)
return *this;
Profiled& operator-=(const Profiled& x)
return *this;
Profiled& operator*=(const Profiled& x)
return *this;
Profiled& operator/=(const Profiled& x)
return *this;
template<typename Domain>
vector<Statistic> Profiled<Domain>::stat{1};
int main()
typedef Profiled<double> Float;
Float::StatisticScope s;
Float x = 1.0, y = 2.0, res = 0.0;
res = x+y*x+y;
cout << s.current().count() << endl;
using namespace numeric::ublas;
Float::StatisticScope s;
matrix<Float> x{10, 20},y{20,5},res{10,5};
res = prod(x,y);
cout << s.current().count() << endl;
Output is:
P.S. Your matrix loop is not cache-friendly, and as the result very inefficient.
int size = 512;
float a[size][size];
This is not legal C++ code. C++ does not support VLA.