3d -> 1D array indexing - c++

in C++, what is the indexing value for a W * H * D sized 3D array?
for a particular i, j, k is this the correct indexing:
i*W*H+j*W+k

What you have written is equivalent to the pointer arithmetic that this would do:
T x[D][H][W];
x[i][j][k]; // Pointer arithmetic done here
Obviously, depending on how you order D, H and W (or i, j, k), the calculation will differ.

There is no one "correct" order, but the version you've given should work. The order in which you apply the indices will determine whether you do row-major or column-major indexing. If you're porting Fortran code (for example) it can make sense to reverse the "normal" C order.

Width, height and depth are meaningless in this context. What you need to know is that multidimensional arrays are stored in row-major order.

Yes, assuming i varies from 0 ... D-1, j varies from 0 ... H-1, and k varies from 0 ... W-1.
Usually, though, the purpose of having an indexer, I thought, was to express relations within a sparse matrix so you didn't need to deal with the whole thing (and expend memory for it). If your data span the whole matrix, you might look into creating the 3d matrix as a pointer to an array of pointers, which themselves each point to an array of pointers. Using this allows you to use the x[i][j][k] notation but may be faster.
See http://www.nr.com/cpppages/chapappsel.pdf for a description.

If you need to to iterarate over all elements it is best to do in
for i
for j
for k
order. This way, it would be fastest, because index of array is incremented by one each time and values could be precached.
There is no only one correct way to do this but you probably chose best one.

Related

How does the dimention argument of `sgemm` work?

I'm trying to understand the documentation for sgemm as I am transitioning code from using this library to a different library.
The function prototype is
sgemm ( character TRANSA,
character TRANSB,
integer M,
integer N,
integer K,
real ALPHA,
real, dimension(lda,*) A,
integer LDA,
real, dimension(ldb,*) B,
integer LDB,
real BETA,
real, dimension(ldc,*) C,
integer LDC
)
I am having trouble understanding the role or LDA and LDB. The documentation says
LDA is INTEGER
On entry, LDA specifies the first dimension of A as declared
in the calling (sub) program. When TRANSA = 'N' or 'n' then
LDA must be at least max( 1, m ), otherwise LDA must be at
least max( 1, k ).
What does it mean that it specifies the first dimension of A? Is this like switching between row and column major? Or is this slicing the tensor?
LD stands for leading dimension. BLAS is originally a library of Fortran 77 subroutines and in Fortran matrices are stored column-wise: A(i,j) is immediately followed in memory by A(i+1,j), which is opposite of C/C++ where a[i][j] is followed by a[i][j+1]. In order to access element A(i,j) of a matrix that has dimensions A(LDA,*) (which reads as LDA rows and an unspecified number of columns), you need to look (j-1)*LDA + (i-1) elements from the beginning of the matrix (Fortran arrays are 1-indexed by default), therefore you need to know the value of LDA. You don't need to know the actual number of columns, therefore the * in the dummy argument.
It is the same in C/C++. If you have a 2D array declared as a[something][LDA], then element a[i][j] is located i*LDA + j positions after the start of the array, and you only need to know LDA - the value of something does not affect the calculation of the address of a[i][j].
Although GEMM operates on an M x K matrix A, the actual data may be embedded in a bigger matrix that is LDA x L, where LDA >= M and L >= K, therefore the LDA is specified explicitly. The same applies to LDB and LDC.
BLAS was developed many years ago when computer programming was quite different than what it is today. Memory management, in particular, was not as flexible as it is nowadays. Allocating one big matrix and then using and reusing portions of it to store smaller matrices was the norm. Also, GEMM is extensively used in, e.g., iterative algorithms that work on various sub-matrices and it is faster to keep the data in the original matrix and just specify the sub-matrix location and dimension, so you need to provide both dimensions.
Starting with Fortran 90, the language has array slicing and automatic array descriptors that allow one to discover both the dimensions of a slice and those of the bigger matrix, so if GEMM was written in Fortran 90 or later, it wouldn't be that verbose in respect to its arguments. But even if that was the case, C doesn't have array descriptors, so you'll still have to provide all those arguments in order to make GEMM callable from C. In C++, one can hide the descriptor inside a matrix class, and many math libraries actually do so (for example, Scythe).

Reshaping Fortran arrays

I have a huge m by 1 array (m is very large) called X which is a result of Fortran matmul operation. My problem is to store this apparently 2D array into an 1D array Y of size m.
I tried with Y = reshape(X, [[2]]) and this result some elements NaN. Can anyone point me to Fortran commands to do it quickly. The elements of X may be zero or non-zero.
The second argument of reshape (or the one with keyword shape=) is the shape of the function's result. In your call, you have requested shape [2].
An array with shape [2] is a rank-1 array with two elements. You want a rank-1 array with m elements:
Y = RESHAPE(X, [m])
Now, in this case there's no need to use reshape:
Y = X(:,1)
where the right-hand side is the rank-1 array section of X.
When you have Y=reshape(X,[2]), if Y is not allocatable and not of size 2 then you have a problem which may indeed result in your compiler deciding---as it is quite entitled to do---to give you a few NaNs.
Note also that you may not need to reshape your array, depending on how you intend to later use it.

Performace of large 3D arrays: contiguous 1D storage vs T***

I wonder if anyone could advise on storage of large (say 2000 x 2000 x 2000) 3D arrays for finite difference discretization computations. Does contiguous storage float* give better performance then float*** on modern CPU architectures?
Here is a simplified example of computations, which are done over entire arrays:
for i ...
for j ...
for k ...
u[i][j][k] += v[i][j][k+1] + v[i][j][k-1]
+ v[i][j+1][k] + v[i][j-1][k] + v[i+1][j][k] + v[i-1][j][k];
Vs
u[i * iStride + j * jStride + k] += ...
PS:
Considering size of problems, storing T*** is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in T*** case compares to index computation and single dereferencing in case of 1D array.
These are not apples-to-apples comparisons: a flat array is just that - a flat array, which your code partitions into segments according to some logic of linearizing a rectangular 3D array. You access an element of an array with a single dereference, plus a handful of math operations.
float***, on the other hand, lets you keep a "jagged" array of arrays or arrays, so the structure that you can represent inside such an array is a lot more flexible. Naturally, you need to pay for that flexibility with additional CPU cycles required for dereferencing pointers to pointers to pointers, then a pointer to pointer, and finally a pointer (the three pairs of square brackets in the code).
Naturally, access to the individual elements of float*** is going to be a little slower, if you access them in truly random order. However, if the order is not random, the difference that you see may be tiny, because the values of pointers would be cached.
float*** will also require more memory, because you need to allocate two additional levels of pointers.
The short answer is: benchmark it. If the results are inconclusive, it means it doesn't matter. Do what makes your code most readable.
As #dasblinkenlight has pointed out, the structures are nor equivalent because T*** can be jagged.
At the most fundamental level, though, this comes down to the arithmetic and memory access operations.
For your 1D array, as you have already (almost) written, the calculation is:
ptr = u + (i * iStride) + (j * jStride) + k
read *ptr
With T***:
ptr = u + i
x = read ptr
ptr = x + j
y = read ptr
ptr = y + k
read ptr
So you are trading two multiplications for two memory accesses.
In computer go, where people are very performance-sensitive, everyone (AFAIK) uses T[361] rather than T[19][19] (*). This decision is based on benchmarking, both in isolation and the whole program. (It is possible everyone did those benchmarks years and years ago, and have never done them again on the latest hardware, but my hunch is that a single 1-D array will still be better.)
However your array is huge, in comparison. As the code involved in each case is easy to write, I would definitely try it both ways and benchmark.
*: Aside: I think it is actually T[21][21] vs. t[441], in most programs, as an extra row all around is added to speed up board-edge detection.
One issue that has not been mentioned yet is aliasing.
Does your compiler support some type of keyword like restrict to indicate that you have no aliasing? (It's not part of C++11 so would have to be an extension.) If so, performance may be very close to the same. If not, there could be significant differences in some cases. The issue will be with something like:
for (int i = ...) {
for (int j = ...) {
a[j] = b[i];
}
}
Can b[i] be loaded once per outer loop iteration and stored in a register for the entire inner loop? In the general case, only if the arrays don't overlap. How does the compiler know? It needs some type of restrict keyword.

OpenCV Mat array access, which way is the fastest for and why?

I am wondering about the way of accessing data in Mat in OpenCV. As you know, we can access to get data in many ways. I want to store image (Width x Height x 1-depth) in Mat and looping access each pixel in the image. Using ptr<>(irow) to get row-pixel and then access each column in the row is the best way? or using at<>(irow,jcol) is the best? or using directly calculate the index by using index = irow*Width + jrow is the best? Anyone know the reason.
Thanks in advance
You can find information here in the documentation: the basic image container and how to scan images.
I advice you to practice with at (here) if you are not experienced with OpenCV or with C language types hell. But the fastest way is ptr as Nolwenn answer because you avoid the type checking.
at<T> does a range check at every call, thus making it slower than ptr<T>, but safer.
So, if you're confident that your range calculations are correct and you want the best possible speed, use ptr<T>.
I realize this is an old question, but I think the current answers are somehow misleading.
Calling both at<T>(...) and ptr<T>(...) will check the boundaries in the debug mode. If the _DEBUG macro is not defined, they will basically calculate y * width + x and give you either the pointer to the data or the data itself. So using at<T>(...) in release mode is equivalent to calculating the pointer yourself, but safer because calculating the pointer is not just y * width + x if the matrix is just a sub-view of another matrix. In debug mode, you get the safety checks.
I think the best way is to process the image row-by-row, getting the row pointer using ptr<T>(y) and then using p[x]. This has the benefit that you don't have to count with various data layouts and still plain pointer for the inner loop.
You can use plain pointers all the way, which would be most efficient because you avoid one the multiplication per row, but then you need to use step1(i) to advance the pointer. I think that using ptr<T>(y) is a nice trade-off.
According to the official documentations, they suggest that the most efficient way is to get the pointer to the row first, and then just use the plain C operator []. It also saves a multiplication for each iteration.
// compute sum of positive matrix elements
// (assuming that M isa double-precision matrix)
double sum=0;
for(int i = 0; i < M.rows; i++)
{
const double* Mi = M.ptr<double>(i);
for(int j = 0; j < M.cols; j++)
sum += std::max(Mi[j], 0.);
}

How to select a column from a row-major array in sub-linear time?

Lets say that I'm given a row major array.
int* a = (int *)malloc( 9 x 9 x sizeof(int));
Look at this as a 2D 9x9 array where a (row,column) index corresponds to [row * 9 + column]
Is there a way where I can select a single column from this array in sub-linear time?
Since the columns wont be contiguous, we cant do a direct memcpy like we do to get a single row.
The linear-time solution would be obvious I guess, but I'm hoping for some sub-linear solution.
Thanks.
It is not clear what you mean by sublinear. If you consider the 2D array as NxN size, then sublinear on N is impossible. To copy N elements you need to perform N copy operations, the copy will be linear on the number of elements being copied.
The comment about memcpy seem to indicate that you mistakenly believe that memcpy is sublinear on the number of elements being copied. It is not. The advantage of memcpy is that the constant hidden in the big-O notation is small, but the operation is linear on the size of the memory being copied.
The next question is whether the big-O analysis actually makes sense. If your array is 9x9, then the effect hidden in the constant of the big-O notation can be more important than the complexity.
I don't really get what you mean but consider:
const size_t x_sz=9;
size_t x=3, y=6; //or which ever element you wish to access
int value=a[Y*x_sz+x];
this will be a constant time O(1) expression. It must calculate the offset and load the value.
to iterate through every value in a column:
const size_t x_sz=9, y_sz=9;
size_t x=3; //or which ever column you wish to access
for(size_t y=0; y!=y_sz; ++y){
int value=a[Y*x_sz+x];
//value is current column value
}
again each iteration is constant time, the whole iteration sequence is therefore O(n) (linear), note that it would still be linear if it was contiguous.