Using arrays in C++ CT reconstruction algorithm - c++

I'm developing a CT reconstruction algorithm using C++. I'm using C++ because I need to use a library written in C++ that will let me read/write a specific file format.
This reconstruction algorithm involves working with 3D and 2D images. I've written similar algorithms in C and MATLAB using arrays. However, I've read that, in C++, arrays are "evil" (see http://www.parashift.com/c++-faq-lite/containers.html). The way I use arrays to manipulate images (in C) is the following (this creates a 3D array that will be used as a 3D image):
int i,j;
int *** image; /* another way to make a 5x12x27 array */
image = (int ***) malloc(depth * sizeof(int **));
for (i = 0; i < depth; ++i) {
image[i] = (int **) malloc(height * sizeof(int *));
for (j = 0; j < height; ++j) {
image[i][j] = (int *) malloc(width * sizeof(int));
}
}
or I use 1-dimensional arrays and do index arithmetic to simulate 3D data. At the end, I free the necessary memory.
I have read that there are equivalent ways of doing this in C++. I've seen that I could create my own matrix class that uses vectors of vectors (from STL) or that I could use the boost-matrix library. The problem is that this makes my code look bloated.
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?

Working at a higher level always saves you time given equal familiarity with both types of code. It's usually simpler and you might not need to bother with some tasks like deleting.
That said, if you already have the C code and are basically converting malloc to new (or leaving it as-is) then it makes perfect sense to leave it. No reason to duplicate work for no advantage. If you're going to be extending it and adding more features you might want to think about a rewrite. Image manipulation is often an intensive process and I see straight code like yours all the time for performance reasons.
Arrays have a purpose, vectors have a purpose, and so on. You seem to understand the tradeoffs so I won't go into that. Understanding the context of what you're doing is necessary; anyone who says that arrays are always bad or vectors are always too much overhead (etc.) probably doesn't know what they're talking about.

I know it looks difficult at first, and your code seems simple - but eventually yours is going to hurt.
Use a library like boost, or consider a custom 3D image toolkit like vtk

if the 3D canvas has a fixed size you won't win much by using containers. I would avoid allocating the matrix in small chunks as you do, though, and just instead do
#define DIM_X 5
#define DIM_Y 12
#define DIM_Z 27
#define SIZE (DIM_X * DIM_Y * DIM_Z)
#define OFFS(x, y, z) (x + y * DIM_X + z * (DIM_Y * DIM_X))
and then
class 3DImage {
private unsigned int pixel_data[SIZE];
int & operator()(int x, int y, int z) { return pixel_data[OFFS(x,y,z)]; }
}
after which you can do e.g.
3DImage img;
img(1,1,1) = 10;
img(2,2,2) = img(1,1,1) + 2;
without having any memory allocation or algorithm overhead. But as some others have noted, the choice of the data structure also depends on what kind of algorithms you are planning to run on the images. You can always however adapt a third-party algorithm e.g. for matrix inversion with a proper facade class if needed; and this flat representation is much faster than the nested arrays of pointers you wrote.
If the dimensions are not fixed compile time, you can obviously still use exactly the same approach, it's just that you need to allocate pixel_data dynamically and store the dimensions in the 3DImage object itself. Here's that version:
class 3DImage {
private unsigned int *pixel_data;
unsigned int dim_x, dim_y, dim_z;
3DImage(int xd, int yd, int zd) { dim_x = xd; dim_y = yd; dim_z = zd;
pixel_data = new int[dim_x * dim_y * dim_z];
}
virtual ~3DImage() { delete pixel_data; }
int & operator(int x, int y, int z) {
return pixel_data[x + y * dim_x + z * dim_y * dim_x];
}
}

My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
I personally prefer to use basic arrays. By basic I mean a 1D linear array. Say you have a 512 X 512 image, and you have 5 slices, then the image array looks like following:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
float* img = new float[sizeX * sizeY * sizeZ]
To access of a pixel/voxel at location (x,y,z), you would need to do:
float val = img[z*sizeX*sizeY + y*sizeX + sizeX];
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
To use containers is more like a programming thing (easier, safer, exception catching....). If you are an algorithm guy, then it might not be your concern at all. However, one example to use <vector> in C++, you can always do this:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
std::vector<float> img(sizeX * sizeY * sizeZ);
float* p = &img[0];
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
I don't see why array makes you less productive. Of course, c++ guys would prefer to use vectors to raw arrays. But again, it is just a programming thing.
Hope this helps.
Supplement:
The easiest way to do a 2D/3D CT recon would be to use MATLAB/python + C/C++; But again, this would require you sufficient experience when to use which. MATLAB has built in FFT/IFFT, so you don't have to write a C/C++ code for that. I remember I used KissFFT before, and it was no problem.

Related

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?
For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).
No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.
The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.
In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.
If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

POD real and complex vectors and arrays

[Christian Hacki and Barry point out that a question has been asked specifically about complex previously. My question is more general, as it applies to std::vector, std::array, and all the container classes that use allocators. Also, the answers on the other question are not adequate, IMO. Perhaps I could bump the other question somehow.]
I have a C++ application that uses lots of arrays and vectors of real values (doubles) and complex values. I do not want them initialized to zeros. Hey compiler and STL! - just allocate the dang memory and be done with it. It's on me to put the right values in there. Should I fail to do so, I want the program to crash during testing.
I managed to prevent std::vector from initializing with zeros by defining a custom allocator for use with POD's. (Is there a better way?)
What to do about std::complex? It is not defined as a POD. It has a default constructor that spews zeros. So if I write
std::complex<double> A[compile_time_const];
it spews. Ditto for
std::array <std::complex<double>, compile_time_constant>;
What's the best way to utilize the std::complex<> functionality without provoking swarms of zeros?
[Edit] Consider this actual example from a real-valued FFT routine.
{
cvec Out(N);
for (int k : range(0, N / 2)) {
complex Evenk = Even[k];
complex T = twiddle(k, N, sgn);
complex Oddk = Odd[k] * T;
Out[k] = Evenk + Oddk;
Out[k + N / 2] = Evenk - Oddk; // Note. Not in order
}
return Out;
}

Efficient coding with memory management

I've switched recently from matlab to c++ in order to run simulations faster, however it still runs slow. I'm pretty positive that there is much to improve in terms of memory usage.
Consider the following code, it shows an example of two array/vector declaration, that I use in a simulation.
One with known fixed length (array01) and another with unknown length (array02) that changes during the run.
The question here is what is the best/proper/efficient way of declaring variables ( for both array types) in terms of memory usage and performance.
# include <iostream>
# include <vector>
# include <ctime>
# include <algorithm>
using namespace std;
const int n = 1000;
const int m= 100000;
int main()
{
srand((unsigned)time(NULL));
vector <double> array02;
vector <vector<double>> Array01(n,m);
for (unsigned int i=0; i<n; i++)
{
for (unsigned int j=0; j<m;j++)
{
array02.clear();
rr = rand() % 10;
for (unsigned int l = 0 ; l<rr <l++)
{
array02.pushback(l);
}
// perform some calculation with array01 and array02
}
}
}
You should consider defining your own Matrix class with a void resize(unsigned width, unsigned height) member function, and a double get(unsigned i, unsigned j) inlined member function and/or a double& at(unsigned i, unsigned j) inlined member function (both giving Mi,j element). The matrix internal data could be a one-dimensional array or vector of doubles. Using a vector of vectors (all of the same size) is not the best (or fastest) way to represent a matrix.
class Matrix {
std::vector<double> data;
unsigned width, height;
public:
Matrix() : data(), width(0), height(0) {};
~Matrix() = default;
/// etc..., see rule of five
void resize(unsigned w, unsigned h) {
data.resize(w*h);
width = w; height = h;
}
double get(unsigned i, unsigned j) const {
assert(i<width && j<height);
return data[i*width+j];
}
double& at(unsigned i, unsigned j) {
assert(i<width && j<height);
return data[i*width+j];
}
}; // end class Matrix
Read also about the rule of five.
You could also try scilab (it is free software). It is similar to Matlab and might have different performances. Don't forget to use a recent version.
BTW, there are tons of existing C++ numerical libraries dealing with matrices. Consider using one of them. If performance is of paramount importance, don't forget to ask your compiler to optimize your code after you have debugged it.
Assuming you are on Linux (which I recommend for numerical computations; it is significant that most supercomputers run Linux), compile using g++ -std=c++11 -Wall -Wextra -g during the debugging phase, then use g++ -std=c++11 -Wall -Wextra -mtune=native -O3 during benchmarking. Don't forget to profile, and remember that premature optimization is evil (you first need to make your program correct).
You might even spend weeks, or months and perhaps many years, of work to use techniques like OpenMP, OpenCL, MPI, pthreads or std::thread for parallelization (which is a difficult subject you'll need years to master).
If your matrix is big, and/or have additional properties (is sparse, triangular, symmetric, etc...) there are many mathematical and computer science knowledge to master to improve the performance. You can make a PhD on that, and spend your entire life on the subject. So go to your University library to read some books on numerical analysis and linear algebra.
For random numbers C++11 gives you <random>; BTW use C++11 or C++14, not some earlier version of C++.
Read also http://floating-point-gui.de/ and a good book about C++ programming.
PS. I don't claim any particular expertise on numerical computation. I prefer much symbolic computation.
First of all don't try to reinvent the wheel :) Try to use some heavily optimized numerical library, for example
Intel MKL (Fastest and most used math library for Intel and compatible processors)
LAPACK++ (library for high performance linear algebra)
Boost (not only numerical, but solves almost any problem)
Second: If you need a matrix for a very simple program, use vector[i + width * j] notation. It's faster because you save an extra memory allocation.
Your example doesn't event compile. I tried to rewrite it a little:
#include <vector>
#include <ctime>
int main()
{
const int rowCount = 1000;
const int columnCount = 1000;
srand(time(nullptr));
// Declare matrix
std::vector<double> matrix;
// Preallocate elemts (faster insertion later)
matrix.reserve(rowCount * columnCount);
// Insert elements
for (size_t i = 0; i < rowCount * columnCount; ++i) {
matrix.push_back(rand() % 10);
}
// perform some calculation with matrix
// For example this is a matrix element at matrix[1, 3]:
double element_1_3 = matrix[3 + 1 * rowCount];
return EXIT_SUCCESS;
}
Now the speed depends on rand() (which is slow).
As people said:
Prefer a 1d array instead of 2d array for matrices.
Don't reinvent the wheel, use existing library: I think that Eigen library is the best suite for you, judging from your code. It also have very, very optimized code generated since it use C++ template static calculation when ever possible.

Using Eigen and C++ to do a colsum of massive matrix product

I am trying to compute colsum(N * P), where N is a sparse, 1M by 2500 matrix, and P is a dense 2500 by 1.5M matrix. I am using the Eigen C++ library with Intel's MKL library. The issue is that the matrix N*P can't actually exist in memory, it's way too big (~10 TB). My question is whether Eigen will be able to handle this computation through some combination of lazy evaluation and parallelism? It says here that Eigen won't make temporary matrices unnecessarily: http://eigen.tuxfamily.org/dox-devel/TopicLazyEvaluation.html
But does Eigen know to compute N * P in piecewise chunks that will actually fit in memory? IE: it will have to do something like colsum(N * P_1) ++ colsum(N * P_2) ++ .. ++ colsum(N * P_n), where P is split into n different submatrices column-wise and "++" is concatenation.
I am working with 128 GB RAM.
I gave it a try but ended up with a bad malloc (I'm only running on 8GB on Win8). I set up my main() and used a not inline colsum function I wrote.
int main(int argc, char *argv[])
{
Eigen::MatrixXd dense = Eigen::MatrixXd::Random(1000, 100000);
Eigen::SparseMatrix<double> sparse(100000, 1000);
typedef Triplet<int> Trip;
std::vector<Trip> trps(dense.rows());
for(int i = 0; i < dense.rows(); i++)
{
trps[i] = Trip(20*i, i, 2);
}
sparse.setFromTriplets(trps.begin(), trps.end());
VectorXd res = colsum(sparse, dense);
std::cout << res;
std::cin >> argc;
return 0;
}
The attempt was simply:
__declspec(noinline) VectorXd
colsum(const Eigen::SparseMatrix<double> &sparse, const Eigen::MatrixXd &dense)
{
return (sparse * dense).colwise().sum();
}
That had a bad malloc. Sol it looks like you have to split it up manually on your own (unless someone else has a better solution).
EDIT
I improved the function a bit, but the get the same bad malloc:
__declspec(noinline) VectorXd
colsum(const Eigen::SparseMatrix<double> &sparse, const Eigen::MatrixXd &dense)
{
return (sparse * dense).topRows(4).colwise().sum();
}
EDIT 2
Another option would be to make the sparse matrix dense and force a lazy evaluation. I don't think that it would work with a sparse matrix (oh well).
__declspec(noinline) VectorXd
colsum(const Eigen::SparseMatrix<double> &sparse, const Eigen::MatrixXd &dense)
{
Eigen::MatrixXd denseSparse(sparse);
return denseSparse.lazyProduct(dense).colwise().sum();
}
This doesn't give me the bad malloc, but computes a lot of pointless 0*x_i expressions.
To answer your question: Especially, when products are involved, Eigen often evaluates parts of expressions into temporaries. In some situations this could be optimized but is not implemented yet, in some cases this is essentially the most efficient way to implement it.
However, in your case you could simply calculate the colsum of N (a 1 x 2500 vector) and multiply that by P.
Maybe future versions of Eigen will be able to make this kind of optimization themselves, but most of the time it is a good idea to make problem-specific optimizations oneself before letting the computer do the rest of the work.
Btw: I'm afraid sparse.colwise() is not implemented yet, so you must compute that manually. If you are lazy, you can instead compute Eigen::RowVectorXd Nsum = Eigen::RowVectorXd::Ones(N.rows())*P; (I have not checked it, but this might actually get optimized to near optimal code, with the most recent versions of Eigen).

OpenCV Mat array access, which way is the fastest for and why?

I am wondering about the way of accessing data in Mat in OpenCV. As you know, we can access to get data in many ways. I want to store image (Width x Height x 1-depth) in Mat and looping access each pixel in the image. Using ptr<>(irow) to get row-pixel and then access each column in the row is the best way? or using at<>(irow,jcol) is the best? or using directly calculate the index by using index = irow*Width + jrow is the best? Anyone know the reason.
Thanks in advance
You can find information here in the documentation: the basic image container and how to scan images.
I advice you to practice with at (here) if you are not experienced with OpenCV or with C language types hell. But the fastest way is ptr as Nolwenn answer because you avoid the type checking.
at<T> does a range check at every call, thus making it slower than ptr<T>, but safer.
So, if you're confident that your range calculations are correct and you want the best possible speed, use ptr<T>.
I realize this is an old question, but I think the current answers are somehow misleading.
Calling both at<T>(...) and ptr<T>(...) will check the boundaries in the debug mode. If the _DEBUG macro is not defined, they will basically calculate y * width + x and give you either the pointer to the data or the data itself. So using at<T>(...) in release mode is equivalent to calculating the pointer yourself, but safer because calculating the pointer is not just y * width + x if the matrix is just a sub-view of another matrix. In debug mode, you get the safety checks.
I think the best way is to process the image row-by-row, getting the row pointer using ptr<T>(y) and then using p[x]. This has the benefit that you don't have to count with various data layouts and still plain pointer for the inner loop.
You can use plain pointers all the way, which would be most efficient because you avoid one the multiplication per row, but then you need to use step1(i) to advance the pointer. I think that using ptr<T>(y) is a nice trade-off.
According to the official documentations, they suggest that the most efficient way is to get the pointer to the row first, and then just use the plain C operator []. It also saves a multiplication for each iteration.
// compute sum of positive matrix elements
// (assuming that M isa double-precision matrix)
double sum=0;
for(int i = 0; i < M.rows; i++)
{
const double* Mi = M.ptr<double>(i);
for(int j = 0; j < M.cols; j++)
sum += std::max(Mi[j], 0.);
}